PhD Thesis: Greybox Automatic Exploit Generation for Heap Overflows in Language Interpreters

Over the summer I defended my PhD thesis. You can find it here.

To give a super quick summary (prior to a rather verbose one ;)):

  • Pre-2016 exploit generation was primarily focused on single-shot, completely automated exploits for stack-based buffer overflows in things like network daemons and file parsers. In my opinion, the architecture of such systems unlikely to enable exploit generation systems for more complex bug classes, different target software, and in the presence of mitigations.
  • Inspired by the success of fuzzing at bug finding, I wanted to bring greybox input generation to bear on exploit generation. As a monolithic problem exploit generation is too complex a task for this to be feasible, so I set about breaking exploit generation down into phases and implementing greybox solutions for each stage. I designed these phases to be composable, and used a template language to enable for communication of solutions between phases.
  • Composable solver solutions and a human-writable template language gives two, new, powerful capabilities for an exploit generation engine: 1) “Human in the loop” means we can automate the things that are currently automatable, while allowing a human to solve the things we don’t have good solvers for, and 2) If solutions are composable, and if a mock solution can be produced for any stage, then we can solve stages out of order. This leads to efficiency gains if a solution to one stage is expensive to produce, but we can mock it out and see if such a solution would even be useful to later stages, and then come back and solve for it if it does turn out to be useful. In practice I leveraged this to assume particular heap layouts, check if an exploit can be created from that point, and only come back and try to achieve that heap layout if it turns out to be exploitable.
  • My belief is that the future of exploit generation systems will be architectures in which fuzzing-esque input generation mechanisms are applied to granular tasks within exploit generation, and the solutions to these problems are composed via template languages that allow for human provided solutions as necessary. Symbolic execution will play a limited, but important role, when precise reasoning is required, and other over-approximate static analysis will likely be used to narrow the search problems that both the input generation and symex engines have to consider.
  • You’ll find a detailed description of the assumptions I’ve made in Chapter 1, as well as an outline of some of the most interesting future work, in my opinion, in the final chapter.

Background

I originally worked on exploit generation in 2009 (MSc thesis here), and the approach I developed was to use concolic execution to build SMT formulas representing path semantics for inputs that cause the program to crash, then conjoin these formulas with a logical condition expressing what a successful exploit would look like, and ask an SMT solver to figure out what inputs are required to produce the desired output. This works in some scenarios where there are limited protection mechanisms (IoT junk, for example), but has a number of limitations that prevent it from being a reasonable path to real-world automatic exploit generation (AEG) solutions. There’s the ever present issue with concolic execution that scaling it up to something like a web browser is a bit of an open problem, but the killer flaw is more conceptual, and it is worth explaining as it is the same flaw found at the heart of every exploit generation system I am aware of, right up to and including everything that participated in the DARPA Cyber Grand Challenge.

This earlier work (mine and others) essentially treats exploit generation as a two phase process. Firstly, they find a path to a location considered to be exploitable using a normal fuzzer or symbolic execution tool. ‘Exploitable’ almost always translates to ‘return address on the stack overwritten’ in this case, and this input is then passed to a second stage to convert it to an exploit. That second stage consists of rewriting the bytes in the original input that corrupt a function pointer or return address, and potentially also rewriting some other bytes to provide a payload that will execute a shell, or something similar. This rewriting is usually done by querying an SMT solver.

The conceptual flaw in this approach is the idea that a crashing input as discovered by a fuzzer will, in one step, be transformable into a functioning exploit. In reality, a crashing input from a fuzzer is usually just the starting point of a journey to produce an exploit, and more often than not the initial input leading to that crash is largely discarded once the bug has been manually triaged. The exploit developer will then begin to use it as part of a larger exploit that may involve multiple stages, and leverage the bug as one component piece.

Open Problems Circa 2016

From 2009 to 2016 I worked on other things, but in 2016 I decided to return to academia to do a PhD and picked up the topic again. Reviewing the systems that had participated in the DARPA CGC [3], as well as prior work, there were a few apparent interesting open problems and opportunities:

  1. All systems I found were focused on entirely automated exploit generation, with no capacity for tandem exploit generation with a human in the loop.
  2. No research I could find had yet to pick up on the notion of an ‘exploitation primitive’, which is fundamental in the manual construction of exploits, and will presumably be fundamental in any automated, or semi-automated approach.
  3. All systems treated AEG as essentially a two step process of 1) Find a path that triggers a bug, 2) Transform that path into an exploit in one shot, rather than a multi-stage ‘programming’ problem [4]. IMO, this is the primary limitation of these systems, as mentioned above, and the reason they are not extendable to more realistic scenarios.
  4. Nobody was really leveraging greybox input generation approaches (fuzzing) extensively, outside of the task of bug finding.
  5. Almost all systems were still focused on stack-based overflows.
  6. Nobody was working on integrating information leaks, or dealing with ASLR, in their exploit generation systems.
  7. Nobody was working on language interpreters/browsers.
  8. Nobody was working on kernels.

It’s likely that points 5-8 are the ‘reason’ for points 1-4. Once you decide to start tackling any of 5-8, by necessity, you must start thinking about multi-stage exploit generation, having human assistance, addressing bug classes other than stack-based overflows, and using greybox approaches in order to avoid the scalability traps in symbolic execution.

Research

With the above in mind (I didn’t touch 6 or 8), the primary goal for my PhD was to see if I could explore and validate some ideas that I think will push forward the state of the art, and form the foundation of AEG tools in the future. Those ideas are:

  1. Real exploits are often relatively complex programs, with multiple distinct phases in which different tasks must be solved. Much like with ‘normal’ program synthesis, it is likely that different tasks will require different solvers. Also, by breaking the problem down we naturally make it easier to solve as long as solutions to distinct phases are composable. Thus, my first goal was to break the exploitation process down into distinct phases, with the intention of implementing different solvers for each. An interesting side effect of breaking the exploit generation task down into multiple, distinct, phases was it allowed me to integrate the idea of ‘lazy’ resolution of these phases, and solve them out of order. In practice, what this meant was that if a solver for an early stage was much more expensive than a later stage then my system allowed one to ‘mock’ out a solution to the early stage, and only come back to solve it for real once it was validated that having a solution for it would actually lead to an exploit.
  2. Once exploit generation is broken down into multiple phases, we have decisions to make about how to solve each. Symbolic execution has powered the core of most exploit generation engines, but it has terrible performance characteristics. OTOH fuzzing has proven itself to be vastly more scalable and applicable to a larger set of programs. Why hadn’t fuzzing previously taken off as a primary component of of exploit generation? Well, if you treat exploit generation as a single, monolithic, problem then the state space is simply too large and there’s no reasonable feedback mechanism to navigate it. Greybox approaches that do input generation using mutation really need some sort of gradient to scale and a scoring mechanism to tell them how they are doing. Once we have broken the problem down into phases it becomes much easier to design feedback mechanisms for each stage, and the scale of the state space at each stage is also drastically reduced. My second goal was therefore to design purely greybox, fuzzing inspired, solutions for each phase of my exploitation pipeline. I purposefully committed to doing everything greybox to see how far I could push the idea, although in reality you would likely integrate an SMT solver for a few surgical tasks.
  3. I knew it would be unlikely I’d hit upon the best solver for each pipeline stage at a first attempt, and it’s even possible that a real solution for a particular pipeline stage might be an entire PhD’s worth of research in itself. Thus it is important to enable one to swap out a particular automated solution for either a human provided solution, or an alternate solver. My third goal was then to figure out a way to enable multiple solvers to interact, be swappable, and to allow for a human in the loop. To enable this I designed a template approach whereby each stage, and a human exploit developer, could write and update exploits in a template language that further stages could understand and process. I wrote about what this looks like in practice here.

Alongside these goals I had to decide on a bug class and target software type to work with. Stack-based overflows had been done to death, so I decided to go with heap-based overflows which had received no attention. I also decided to target language interpreters (PHP and Python, in practice), as they’re ubiquitous and no exploit generation research had focused on them up to that point.

In the end the research was a lot of fun, and hopefully useful to others. I think the ideas of breaking down exploitation into multiple phases, using greybox solutions as much as possible, and using template-driven exploitation to integrate various solvers, or optionally a human, have real legs and are likely to be found in practical AEG engines of the future.

P.S. Thanks to my colleagues and friends that allowed me to bounce ideas off them, that read drafts, and that were generally awesome. Thanks as well to the anonymous reviewers of CCS, USENIX Security, and IEEE S&P. A lot of words have been written about the problems of the academic review process for both authors and reviewers, but on an individual and personal level I can’t but appreciate the fact that total strangers took their time to work through my papers and provide feedback.

Further Reading

Thesis – Contains all of my research, along with a lot of introductory and background material, as well as ideas for future work. One of the things I wanted to be quite clear about in my work was the limitations, and in those limitations it’s clear there’s plenty of scope for multiple more PhDs from this topic. In Section 1.2 I outline my assumptions, and by thinking about how to alleviate these you’ll find multiple interesting problems. In Section 6.1 I explicitly outline a bunch of open problems that are interesting to me. (Btw, if you do end up working on research in the area and would like feedback on anything feel free to drop me a mail).

Automatic Heap Layout Manipulation for Exploitation (USENIX Security 2018) – A paper on the heap layout problem, and a greybox solution for it. Also discusses exploit templates.

Gollum: Modular and Greybox Exploit Generation for Heap Overflows in Interpreters (ACM CCS 2019) – A paper on an end-to-end architecture for exploit generation using entirely greybox components.

Gollum: Modular and Greybox Exploit Generation for Heap Overflows in Interpreters

At the upcoming ACM Conference on Computer and Communications Security (CCS) I’ll be presenting a paper on Automatic Exploit Generation (AEG), with the same title as this blog post. You can find the paper here. In the paper I discuss a system for automatically discovering primitives and constructing exploits using heap overflows in interpreters. The approach taken in the paper is a bit different from most other AEG solutions in that it is entirely greybox, relying on lightweight instrumentation and various kinds of fuzzing-esque input generation. The following diagram shows the stages of the system, and each is explained in detail in the paper. 

Workflow diagram for Gollum
Workflow diagram showing how Gollum produces exploits and primitives

In terms of evaluation, I used 10 vulnerabilities in the PHP and Python interpreters as tests, and provided these as input to Gollum for it to use in its search for primitives and to build exploits.

Exploit generation and primitive search results
Exploit generation and primitive search results

There are three main takeaways from the paper that I think worth highlighting (see paper for details!):

1. AEG is a multi-stage process, and by breaking the problem into distinct phases it becomes reasonable to attack a number of those phases using a fuzzing-esque combination of lightweight instrumentation and relatively dumb input generation. Traditionally, AEG systems used symbolic execution as their main driver and, while there are some positives to this, it also encounters all of the scalability issues that one expects with symbolic execution. In a paper last year at USENIX Security, I showed how with lightweight instrumentation one could use the existing tests of an application, combined with a fuzzer, to discover language fragments that could be used to perform heap layout manipulation, as well as to allocate interesting objects to corrupt. In the upcoming CCS paper, I show how one can use a similar approach to also discover exploitation primitives, and in certain situations to even build exploits. It’s worth noting that in their paper on FUZE, Wu et al. take a similar approach, and you should check out their paper for another example system. My guess is that in the next couple of years fuzzing-driven exploit generation is likely to be the predominant flavour, with symbolic execution being leveraged in scenarios where its bit-precise reasoning is required and state-space explosion can be limited.

2. When automatically constructing exploits for heap overflows one needs a solution for achieving the desired heap layout and then another solution for building the rest of the exploit. In the CCS paper I introduce the idea of lazy resolution of tasks in exploit generation. Essentially, this is an approach for solving the problems of heap layout manipulation and the rest of the exploit generation in the reverse order. The reason one might want to do this is simple: in an engine where the process for achieving the heap layout can potentially take much longer than the one to create the rest of the exploit (as is the case in Gollum), it makes sense to check if it is feasible to produce an exploit under the assumption that the more difficult problem is solvable, and then only bother to solve it if it enables an exploit. Specifically, in the case of Gollum, I constructed a heap allocator that allows you to request a particular layout, then you can attempt to generate the exploit under the assumption the layout holds, and only later figure out how to achieve it once you know it enables the exploit.

I think this sort of idea might be more generally useful in exploit generation for other vulnerability types as well, e.g. in an exploit generation system for race conditions, one could have a solution that allows one to request a particular scheduling, check if that scheduling enables an exploit, and only then search for the input required to provide the scheduling. Such an approach also allows for a hybrid of manual and automatic components. For example, in our case we assume a heap layout holds, then generate an exploit from it, and finally try and automatically solve the heap layout problem. Our solution for the heap layout problem has a number of preconditions though, so in cases where those conditions are not met we can still automatically generate the rest of the exploit under the assumption that the layout problem is solved, and eventually the exploit developer can manually solve it themselves.

3. In last year’s USENIX Security paper I discussed a random search algorithm for doing heap layout manipulation. As you might expect, a genetic algorithm, optimising for distance between the two chunks you want to place adjacent to each other, can perform this task far better, albeit at the cost of a more complicated and time consuming implementation.

Random search versus GA graph
% of heap layout benchmarks solved by random search (rand) versus the genetic algorithm (evo)

 

Automation in Exploit Generation with Exploit Templates

At last year’s USENIX Security conference I presented a paper titled “Automatic Heap Layout Manipulation for Exploitation” [paper][talk][code]. The main idea of the paper is that we can isolate heap layout manipulation from much of the rest of the work involved in producing an exploit, and solve it automatically using blackbox search. There’s another idea in the paper though which I wanted to draw attention to, as I think it might be generally useful in scaling automatic exploit generation systems to more real world problems. That idea is exploit templates.

An exploit template is a simply a partially completed exploit where the incomplete parts are to be filled in by some sort of automated reasoning engine. In the case of the above paper, the parts filled in automatically are the inputs required to place the heap into a particular layout. Here’s an example template, showing part of an exploit for the PHP interpreter. The exploit developer wants to position an allocation made by imagecreate adjacent to an allocation made by quoted_printable_encode.

$quote_str = str_repeat("\xf4", 123);

#X-SHRIKE HEAP-MANIP 384
#X-SHRIKE RECORD-ALLOC 0 1
$image = imagecreate(1, 2); 

#X-SHRIKE HEAP-MANIP 384 
#X-SHRIKE RECORD-ALLOC 0 2 
quoted_printable_encode($quote_str); 

#X-SHRIKE REQUIRE-DISTANCE 1 2 384

SHRIKE (the engine that parses the template and searches for solutions to heap layout problems) takes as input a .php file containing a partially completed exploit, and searches for problems it should solve automatically. Directives used to communicate with the engine begin with the string X-SHRIKE. They are explained in full in the above paper, but are fairly straightforward: HEAP-MANIP tells the engine it can insert heap manipulating code at this location, RECORD-ALLOC tells the engine it should record the nth allocation that takes place from this point onwards, and REQUIRE-DISTANCE tells the engine that at this point in the execution of the PHP program the allocations associated with the specified IDs must be at the specified distance from each other. The engine takes this input and then starts searching for ways to put the heap into the desired layout. The above snippet is from an exploit for CVE-2013-2110 and this video shows SHRIKE solving it, and the resulting exploit running with the heap layout problem solved. For a more detailed description of what is going on in the video, view its description on YouTube.

So, what are the benefits of this approach? The search is black-box, doesn’t require the exploit developer to analyse the target application or the allocator, and, if successful, outputs a new PHP file that achieves the desired layout and can then be worked on to complete the exploit. This has the knock-on effect of making it easier for the exploit developer to explore different exploitation strategies for a particular heap overflow. In ‘normal’ software development it is accepted that things like long build cycles are bad, while REPLs are generally good. The reason is that the latter supports a tight loop of forming a hypothesis, testing it, refining and repeating, while the former breaks this process. Exploit writing has a similar hypothesis refinement loop and any technology that can make this loop tighter will make the process more efficient.

There’s lots of interesting work to be done still on how exploit templates can be leveraged to add automation to exploit development. In automatic exploit generation research there has been a trend to focus exclusively on full automation and, because that is hard for almost all problems, we haven’t explored in any depth what aspects can be partially automated. As such, there’s a lot of ground still to be broken. The sooner we start investigating these problems the better, because if the more general program synthesis field is anything to go by, the future of automatic exploit generation is going to look more like template-based approaches than end-to-end solutions.

SMT Solvers for Software Security (USENIX WOOT’12)

At WOOT’12 a paper co-written by Julien Vanegue, Rolf Rolles and I will be presented under the title “SMT Solvers for Sofware Security”. An up-to-date version can be found in the Articles/Presentation section of this site.

In short, the message of this paper is “SMT solvers are well capable of handling decision problems from security properties. However, specific problem domains usually require domain specific modeling approaches. Important limitations, challenges, and research opportunities remain in developing appropriate models for the three areas we discuss – vulnerability discovery, exploit development, and bypassing of copy protection”. The motivation for writing this paper is to discuss these limitations, why they exist, and hopefully encourage more work on the modeling and constraint generation sides of these problems.

A quick review of the publication lists from major academic conferences focused on software security will show a massive number of papers discussing solutions based on SMT technology. There is good reason for this 1) SMT-backed approaches such as symbolic/concolic execution have proved powerful tools on certain problems and 2) There are an increasing number of freely available frameworks.

The primary domain where SMT solvers have shone, in my opinion, is in the discovery of bugs related to unsafe integer arithmetic using symbolic/concolic execution. There’s a fairly obvious reason why this is the case; the quantifier free, fixed size, bitvector logic supported by SMT solvers provides direct support for the precise representation of arithmetic at the assembly level. In other words, one does not have to do an excessive amount of work when modeling the semantics of a program to produce a representation suitable for the detection of unsafe arithmetic. It suffices to perform a near direct translation from the executed instructions to the primitives provided by SMT solvers.

The exploit generation part of the paper deals with what happens when one takes the technology for solving the above problem and applies it to a new problem domain. In particular, a new domain in which the model produced simply by tracking transformations and constraints on input data no longer contains enough data to inform a solution. For example, in the case of exploit generation, models that do not account for things like the relationship between user input and memory layout. Obviously enough, when reasoning about a formula produced from such a model a solver cannot account for information not present. Thus, no amount of computational capacity or solver improvement can produce an effective solution.

SMT solvers are powerful tools and symbolic/concolic execution can be an effective technique. However, one thing I’ve learned over the past few years is that they don’t remove the obligation and effort required to accurately model the problem you’re trying to solve. You can throw generic symbolic execution frameworks at a problem but if you’re interested in anything more complex than low level arithmetic relationships you’ve got work to do!

Misleading the Public for Fun and Profit

Sometimes I read a research paper, usually in the area where computer science meets application, and it’s obvious that the authors are far overstating the practical impact of the work. This can be due to the researchers simply not having any exposure to the practical side of the field in which they are investigating and thus accidentally (through ignorance) overstate their claims. Alternatively it can be a purposeful and deliberate attempt to mislead and posture in front of a readership that hopefully won’t know any better.

The first case is presumably simple ignorance but is still lamentable. The obvious solution here is to avoid making such claims at all. If the research cannot stand on its own then perhaps it is not worthwhile? Researchers (both academic and industrial) have a habit of jumping on problems they underestimate, throwing a variety of techniques at them, hoping one sticks and then calling the problem solved. This typically occurs when they are not actually required to solve the problem correctly and robustly but merely as a ‘prototype’. They then get pilloried by anyone who actually has to solve the problem properly and almost always because of a disparity between claims made and the real impact rather than issues with methodology, recording or technical aspects.

The second case is far more insidious and unfortunately I think not uncommon. In academic research it can be easy to impress by combining cutting edge, but not necessarily original, research with a practical problem, sort-of solving parts of it and like before declaring it solved. Often followed quickly by phrases involving ‘game changing’, ‘paradigm shifting’ and so forth. Personally, I think this is a serious problem in the research areas that are less theoretical and more practical. Often the investigators refuse to accept they aren’t actually aware of the true nature of the problem they are dealing with or how it occurs in the real world. Egotistically this is difficult as they are often lauded by their academic peers and therefore surely must grasp the trivialities of the practical world, no? At this point a mixture of ego, need to impress and lack of ethics combine to give us papers that are at best deluded and at worst downright wrong.

Regardless of whether a paper ends up making such claims mistakenly for the first or the second reason the result is the same. It cheapens the actual value of the research, results in a general loss of respect for the capabilities of academia, deludes the researchers further and causes general confusion as to where research efforts should be focused. Worse still is when attempts to overstate the impact are believed by both the media and other researchers resulting in a complete distortion between the actual practical and theoretical value of the research and it’s perceived impact.

Now, on to the paper that has reminded me of this most recently: The latest paper from David Brumleys group at CMU titled AEG: Automatic Exploit Generation. I was looking forward to reading this paper as it was the area I worked on during my thesis but quite honestly it’s incredibly disappointing at best and has serious factual issues at worst. For now let’s focus on the topic at hand ‘overstating the impact of academic research cheapens it and spreads misinformation‘. With the original Patch-Based Exploit Generation paper we had all sorts of stories about how it would change the way in which patches had to be distributed, how attackers would be pushing buttons to generate their exploits in no time at all and in general how the world was about to end. Naturally none of this happened and people continued to use PatchDiff. Unfortunately this is more of the same.

Near the beginning of the most recent paper we have the following claim “Our automatic exploit generation techniques have several immediate security implications. First, practical AEG fundamentally changes the perceived capabilities of attackers“. This statement is fundamentally flawed. It assumes that practical AEG is currently possible on bugs that people actually care about. This is patently false. I’ve written one of these systems. Did it generate exploits? Yes it did. Is it going to pop any program running on a modern operating system with the kinds of vulnerabilities we typically see? Nope. That would require at a minimum another 2 years of development and at that point I would expect a system that is usable by a skilled exploit writer as an augmentation of his skillset rather than a replacement. The few times I did use the tool I built for real exploits it was in this context rather than full blown exploit generation. The system discussed in the mentioned paper has more bells and whistles in some areas and is more primitive in others and it is still an unfathomable distance from having any impact on a realistic threat model.

Moving on, “For example, previously it has been believed that it is relatively difficult for untrained attackers to find novel vulnerabilities and create zero-day exploits. Our research shows this assumption is unfounded“. It’s at this point the distance between the authors of this paper and the realities of industrial/government/military vulnerability detection and exploit development can be seen. Who are the people we are to believe have this view? I would assume the authors themselves do and then extrapolated to the general exploit creating/consuming community. This is an egotistical flaw that has been displayed in many forays by academia into the vulnerability detection/exploit generation world.

Let’s discuss this in two parts. Firstly, in the context of the exploits discussed in this paper and secondly in the context of exploits seen in the real world.

In the case of the bug classes considered in the paper this view is entirely incorrect. Anyone who looks at Full Disclosure can regularly see low hanging bugs being fuzzed and exploited in a cookie cutter style. Fuzz the bug, overwrite the SEH chain, find your trampoline, jump to your shellcode bla bla bla rinse and repeat, start a leet h4x0r group and flood Exploit DB. All good fun, no useful research dollars wasted. The bugs found and exploited by the system described are of that quality. Low hanging, fuzzable fruit. The ‘training’ involved here is no more than would be required to set up, install and debug whatever issues come up in the course of running the AEG tool. In our most basic class at Immunity I’ve seen people who’ve never seen a debugger before writing exploits of this quality in a couple of days.

For more complex vulnerabilities and exploits that require a skilled attacker this AEG system doesn’t change the threat model. It simply doesn’t apply. A fully functional AEG tool that I can point at Firefox and press the ‘hack’ button (or any tool that had some sort of impact on real threats. I’d be happy with exploit assistance rather than exploit generation as long as it works) would of course, but we are a long, long way from that. This is not to say we won’t get there or that this paper isn’t a step in the right direction but making the claim now is simply laughable. To me it just reeks of a research group desperate to shout ‘FIRST!’ and ignoring the real issues.

A few more choice phrases for your viewing pleasure:

Automated exploit generation can be fed into signature generation algorithms by defenders without requiring real-life attacks” – Fiction again. This would be possible *if* one had a usable AEG system. The word I presume they are looking for is *could*, “could be fed into”.

In order to extend AEG to handle heap-based overflows we would need to also consider heap management structures, which is a straight-forward extension” – Again, this displays a fundamental ignorance of what has been required to write a heap exploit for the past six or so years. I presume they heard about the unlink() technique and investigated no further. Automatic exploit generation of heap exploits requires one to be able to discover and trigger heap manipulation primitives as well as whatever else must be done. This is a difficult problem to solve automatically and one that is completely ignored.

In reference to overflows that smash local variables and arguments that are dereferenced before the function returns and therefore must be valid – “If there is not enough space to place the payload before the return address, AEG can still generate an exploit by applying stack restoration, where the local variables and function arguments are overwritten, but we impose constraints that their values should remain unchanged. To do so, AEG again relies on our dynamic analysis component to retrieve the runtime values of the local variables and arguments” – It’s at this point that I start to wonder if anyone even reviewed this thing. In any program with some amount of heap non-determinism, through normal behaviour or heap base randomisation, this statement makes no sense. Any pointers to heap allocated data passed as arguments or stored as local variables will be entirely different. You may be lucky and end up with that pointer being in an allocated heap region but the chances of it pointing to the same object are rather slim in general. Even in the context of local exploits where you have much more information on heap bases etc. this statement trivialises many problems that will be encountered.

Conclusion

With the above paper I have two main issues. One is with the correctness of some of the technical statements made and the other is with distortion between reality and the stated impact and generality of the work. For the technical issues I think the simple observation that they are there is enough to highlight the problem. The flawed statements on impact and generality are more problematic as they display a fundamental corruption of what a scientific paper should be.

I have a deep respect for scientific research and the ideals that I believe it should embody. Much of this research is done by university research groups and some of the papers produced in the last century are among humanities greatest intellectual achievements. Not all papers can be revolutionary of course but even those that aren’t should aim to uphold a level of scientific decorum so that they may contribute to the sum of our knowledge. In my opinion this single idea should be at the heart of any university researcher being funded to perform scientific investigation. A researcher is not a journalist nor a politician and their papers should not be opinion pieces or designed to promote themselves at the expense of facts. There is nothing wrong with discussing perceived impact of a paper within the paper itself but these statements should be subjected to the same scientific rigour that the theoretical content of the paper is. If one finds themselves unqualified (as in the above paper) to make such statements then they should be excluded. Facts are all that matter in a scientific paper, distorting them through ignorance is incompetence, distorting them on purpose is unethical and corrupt.

Game Over! Thank you for playing Academia

I’ve recently finished my Msc dissertation, titled “Automatic Generation of Control Flow Hijacking Exploits for Software Vulnerabilities“. A PDF copy of it is available here should you feel the need to trawl through 110 or so pages of prose, algorithms, diagrams and general ramblings. The abstract is the following:

Software bugs that result in memory corruption are a common and dangerous feature of systems developed in certain programming languages. Such bugs are security vulnerabilities if they can be leveraged by an attacker to trigger the execution of malicious code. Determining if such a possibility exists is a time consuming process and requires technical expertise in a number of areas. Often the only way to be sure that a bug is in fact exploitable by an attacker is to build a complete exploit. It is this process that we seek to automate. We present a novel algorithm that integrates data-flow analysis and a decision procedure with the aim of automatically building exploits. The exploits we generate are constructed to hijack the control flow of an application and redirect it to malicious code.

Our algorithm is designed to build exploits for three common classes of security vulnerability; stack-based buffer overflows that corrupt a stored instruction pointer, buffer overflows that corrupt a function pointer, and buffer overflows that corrupt the destination address used by instructions that write to memory. For these vulnerability classes we present a system capable of generating functional exploits in the presence of complex arithmetic modification of inputs and arbitrary constraints. Exploits are generated using dynamic data-flow analysis in combination with a decision procedure. To the best of our knowledge the resulting implementation is the first to demonstrate exploit generation using such techniques. We illustrate its effectiveness on a number of benchmarks including a vulnerability in a large, real-world server application.

The implementation of the described system is approx. 7000 lines of C++. I probably won’t be releasing the code as I’m fairly sure I signed over my soul (and anything I might create) to the University earlier in the year. The two core components are a data-flow/taint analysis library and higher level library that uses the previous API to perform data-flow/taint analysis over x86 instructions (as given to us by Pin). Both of these components are useful in their own right so I think I’m going to do a full rewrite (with added GUI + DB) and open source the code in the next couple of months. Hopefully they’ll prove useful for others working on dynamic analysis problems.

Exploit generation, a specialisation of testing?

It sounds like a silly question, doesn’t it? Nobody would consider exploit development to be a special case of vulnerability detection. That said, all research on exploit generation that relies on program analysis/verification theory (From now on assume these are the projects I’m discussing. Other approaches exist based on pattern matching over program memory but they are riddled with their own problems.) has essentially ridden on the coat-tails of research and tools developed for test-case generation. The almost standard approach to test-case generation consists of data flow analysis in combination with some sort of decision procedure. We then generate formulae over the paths executed to create inputs that exercise new paths. This is also the exact approach taken by all exploit generation projects.

There are pros and cons to this relationship. For instance, some activities are crucial to both test-case generation and exploit generation, e.g., data flow and taint analysis. Algorithms for these activities are almost standardised at this stage and when we work on exploit generation we can basically lift code from test generation projects. Even for these activities though there are sufficient differences and opportunities presented by exploit generation that it is worth doing some re-engineering. For example, during my research I extending the taint analysis to reflect the complexity of the instructions involved in tainting a location. When building a formula to constrain a buffer to shellcode we can then use this information to pick the locations that result in the least complex formulae. An exploit only needs a single successful formula (usually) so we can pick and choose the locations we want to use; testing on the other hand typically requires exhaustive generation and thus this optimisation hasn’t been previously applied because the benefits are less evident (but still might be a decent way of increasing the number of test cases generated in a set time frame).

The two problems share other similarities as well. In both cases we find ourselves often dissatisfied with the results of single path analysis. When generating an exploit the initial path we analyse might not be exploitable but there may be another path to the same vulnerability point that is. Again in this case we can look to test case generation research for answers. It is a common problem to want to focus on testing different sub-paths to a given point in a program and so there are algorithms that use cut points and iterative back-tracking to find relevant paths. So with such research available one might begin to think that exploit generation is a problem that will be inadvertently ‘solved’ as we get better at test case generation.

Wrong.

With test case generation all test cases are essentially direct derivatives from the analysis of a previous test case. We build a formula that describes a run of the program, negate a few constraints or add on some new ones, and generate a new input. Continue until boredom (or some slightly more scientific measure). What I am getting at is that all the required information for the next test is contained within the path executed by a previous test. Now consider an overflow on Windows where we can corrupt the most significant byte of a function pointer that is eventually used. If you decide to go down the ‘heap spray’ route to exploit this vulnerability you immediately hit a crucial divergence from test case generation. In order to successfully manipulate the structure of a programs heap(s) we will almost always require information that is not contained in the path executed to trigger the vulnerability initially. Discovering heap manipulation primitives is a problem that requires an entirely different approach to the test case generation approach of data flow analysis + decision procedure over a single path. It is also not a problem that will likely ever be solved by test case generation research as it really isn’t an issue in that domain. Whole classes of vulnerabilities relating to memory initialisation present similar difficulties.

What about vulnerability classes that fit slightly better into the mould carved out by test generation research? One of the classes I considered during my thesis was write-4-bytes-anywhere style vulnerabilities. Presuming we have a list of targets to overwrite in such cases (e.g. the .dtors address) this is a solvable problem. But what if we only control the least significant byte (or word) and can’t modify the address to equal one of the standard targets? Manually one would usually see what interesting variables fall within the controllable range, looking for those that will be at a static offset from the pointer base. But what is an ‘interesting variable’? Lets assume there are function pointers within that range. How do we automatically detect them? Well we’d need to monitor the usage of all byte sequences within the range we can corrupt. It’s a problem we can approach using data flow/taint analysis but once you start to consider that solution it starts to look a lot like a multi-path analysis problem but over a single path. We are no longer considering just data that is definitely tainted by user input, we are considering data that might be, and as we can only control a single write we have different ‘paths’ depending on what bytes we choose to modify….. and we’re doing this analysis over a single concrete path? Fun.

I guess the core issue is that test-case generation and exploit generation are close enough that we can get adequate results by applying the algorithms developed for the former to the latter. To get consistently good results though we need to consider the quirks and edge cases presented by exploit generation as a separate problem. Obviously there are many useful algorithms from test case generation research that can be applied to exploit generation but to apply these blindly misses opportunities for optimisations and improvements (e.g. the formula complexity issue mentioned). On the other hand there are problems that will likely never be considered by individuals working on test case generation; these problems will require focused attention in their own right if we are to begin to generate exploits for certain vulnerability classes.

Automatic exploit generation: Lessons learned so far

Here are a couple of thoughts that are bouncing around in my head as I come to the concluding stages of my v1 prototype. I’ve made a number of design decision in the initial implementation that I now see issues with, but hopefully by documenting them v2 will be better!

Using in-process analysis tools might not be such a good idea: Early on I decided to use dynamic analysis to gather information about taint data propagation and path conditions. Due to previous work (such as catchconv) using dynamic binary instrumentation frameworks, like Valgrind, I pretty much immediately decided I would do the same. After writing a couple of basic apps for Valgrind, Pin and DynamoRio, I eventually settled on Pin due to its cross platform support and C++ codebase. One critical factor I ignored is that these tools really aren’t designed with malicious code in mind. So, when you do things like trash a stored instruction pointer it can really confuse the DBI tool.

Other problems can occur if the vulnerability ends up writing over several hundred megabytes of junk in the application address space. This can lead to difficult to debug problems, where the memory in use by the injected analysis client is being corrupted, as well as that of the application under test.

More basic, but just as time consuming, problems stem from the fact that these in-process analysis clients are rather difficult to debug once something goes wrong. The three frameworks mentioned vary in their support for debugging and error logging, but in general it is exceedingly annoying to debug anything non-trivial. Simple segfaults have eaten hours of my time and often you are left resorting to printf based ‘debugging’.

The final issue I’ve come across is obvious, but should still be mentioned. Complex runtime instrumentation, such as dataflow analysis, really effect the responsiveness and runtime of the application. My current code, which is admittedly incredibly unoptimised, can increase the runtime of ls from milliseconds to about 20 seconds. This isn’t much of an issue if you don’t need to interact with the application to trigger the vulnerability, but in a case where some buttons need to be clicked or commands entered, it can become a significant inconvenience.

Assisted may be better than automated: The idea of this project is to investigate what vulnerability classes are automatically exploitable, and to develop a prototype that can show the results. I’ve achieved this goal for a sufficient variety of vulnerabilities and shown that automation is in fact possible.

There is a but here though; to continue this project would require constant attention and coding to replace the human effort in exploit generation each time a new class of vulnerability comes along, or a change in exploit technique. As I’ve moved from basic stack overflows to considering more complicated scenarios, the differences in exploit types become more time consuming to encode. Because of this, I intend (when I implement v2 of the tool in the coming months) to move away from complete automation. By putting the effort into providing a decent user interface, it will be possible to inform an exploit writer of the results of data flow and constraint analysis and have them make an educated judgement on the type of exploit to attempt, and specify some parameters. Working from this point of view should make the entire tool much less effort to port between operating systems also.

Information on memory management is very important: This is an obvious point for anyone that has had to write heap exploits in the last 5 years or so. It is near impossible to automatically generate a Linux heap exploit without having some information on the relationship between user input and the structure of the processes heap. When manually writing an exploit we will often want to force the program to allocate large amounts of memory, and the usual way to do this involves jumping into the code/disassembly and poking around for a while until you find a memory allocation dependent on the size of some user supplied field, or a loop doing memory allocation with a user influenceable bound.

Essentially, to have a solution that takes care of both scenarios we need a way to infer relationships between counters and program input. The first paper to discuss how to do this using symbolic execution was published in March of this year, and is a good read for anyone considering implementing this kind of tool.

As I hadn’t added loop detection, or other required functionality described in that paper, my current tool is unable to do the analysis described. I consider this a rather annoying drawback, and it will be among my highest priorities for v2. Hacky solutions are possible, such as modelling the result of strlen type functions on user input, but this would miss a number of scenarios and is in general quite an ugly approach.

Extending to new vulnerability classes

One of the things I really want to support is automatic generation of exploits for modern heap based vulnerabilities. At the moment I have other features to implement so this has gotten pushed back for the time being. In the meantime, I wanted to see how hard it was to extend my current functionality from stack overflows that trash the EIP to other vulnerability/exploit combinations. So, yesterday I went and found an example from an wargame I used play. It contains a pretty silly vulnerability, as follows:

(The aim here was to see how hard it was to extend my current code, not to see how complicated I could make a test case. The array pointed to by input is controllable by the user and 256 bytes in size. The function some_func is some benign function that just exits the program)

void func_ptr_smash(char *input)
{
    void (*func_ptr)(int);
    char buffer[248];
    
    func_ptr = &some_func;
    strcpy(buffer, input);

    (*func_ptr) (z);
}

Extending the tool turned out to be pretty simple. In a stack smash that overwrites the stored EIP we attempt to generate a query that expresses the constraint memLocation == trampolineAddr, where memLocation is the value in ESP at the ret instruction, and trampolineAddr is the address of some usable trampoline.

Modifying this to handle a function pointer overwrite can be done in a couple of ways, depending on what parts of the address space are randomised and how generic we want the solution to be. The most straight forward solution is simply to treat a function pointer overwrite like a slight twist on the previous situation. Essentially, instead of a ret instruction popping a tainted value into the EIP, which we can then redirect to a trampoline, we have a call instruction where the argument is tainted and can be sused to redirect to a trampoline. So, instead of generating a constraint on the value of ESP we have to express the constraint on whatever register/memory location the call instruction uses as an argument.

Another potential approach is usable if some non-randomised data locations exist that we can use as a shellcode buffer. Once again the same data flow analysis can be used to find the location of a suitable home for shellcode in these areas. In this case we avoid the requirement for a register to contain the address of one of these buffers and can just jump right into it by generating a constraint that specifies the operand to the call is equal to the static memory address of our shellcode.

Here is an example run of the tool in which we use the latter method. I tested it on Ubuntu 8.04 which has a non-randomised heap by default and thus we can hardcode the address of shellcode buffers on the heap. The vulnerable program was compiled with -O0 -fno-stack-protector -D_FORTIFY_SOURCE=0, otherwise gcc would have repositioned the overflow buffer and optimised the call so that the address is calculated at compile time.

In conclusion, my current approach was easily extendible in this case due to the similarity between a function pointer overwrite and smashing the stored EIP. Both cases essentially only differ at the point where the tainted memory location or register is moved to the EIP. They are detectable in exactly the same way and have the same symptom; namely, the attempted movement of tainted data into the EIP register. I would hypothesise that any vulnerability* with the same symptom can be dealt with in a similar way. So then we must ask, what other symtoms are there? Well, what can cause a program to crash? We have seen the case where the program attempts to execute data at an unmapped memory location, so that leaves invalid reads and writes.

Exploiting vulnerabilities detected via an invalid read/write

Old glibc heap exploits are a simple example of those that would be detectable as a result of an invalid write. I won’t go into the details of the method but the unlink macro essentially has the following effect, where next->fd and next->bk are under our control:

 *( next->fd + 12 ) = next->bk
 *( next->bk + 8 ) = next->fd

In this case the vulnerability will probably be discovered when the application tries the first write. This is a crucial difference between the earlier vulnerabilities and this one when considering how to automatically generate an exploit. Detecting the potential for an exploit is still simple – when an instruction writes memory we can simply check the destination operand to see if it is tainted**. The problem now becomes ‘how do we automatically determine a useful location to write?’. Assuming all data sources are randomised our options are limited somewhat. Depending on the protections in place the DTORS/GOT might be usable. Other than that (and depending on what the value we are corrupting actually points to) we could potentially just change the lower bytes of the location being written and attempt to modify some useful program variable on the heap***. This has the disadvantage of requiring some sort of input from the user, as determining what is a useful variable would seem to be mostly application specific.

If the location being written is on the stack we could potentially modify the lower bytes to change a stored EIP/EBP and then proceed in the same fashion as before. To automate this we could note the value of the ESP when a call instruction occurs and calculate the difference between this value and the location of the variable being written.

To sum these options up we have two potential output types. The first is a direct exploit; It writes the DTORS/GOT or changes the lower bytes of a stack variable to modify a stored EIP and may or may not be possible depending on the protections in place and the location being written. The second is a new program input that may lead to another crash at a different location. It is basically another fuzz input except with potentially some program variables corrupted. In this case, the result is that the program can continue execution and we hope that it hits another exploitable crash further on****.

For a read from an invalid address we also have a couple of options. If the variable being read into is later used for something such as memory allocation or as a bound on a loop/function involved in moving memory we might attempt to control this value. To do this automatically would require some form of static analysis to determine if the variable was ever used in such a context (or perhaps multiple runs of dynamic analysis). Once again, as in the write case, we also have the option to just manipulate the destination read in such a way that it is any valid address and hope that another vulnerability is later triggered. Unlike the write case this is pretty trivial though as we can just read from any address in the .text section.

* Obviously this doesn’t apply to certain vulnerability classes. It applies to most I can think of that we will exploit by somehow changing the execution pointer. Not trust problems, many race conditions, other design flaws etc.
** Some slightly more complicated analysis might be required here as sometimes a destination operand in a write can be legitimately tainted but restricted to the bounds of some safe chunk of data
*** We could also change the metadata of some other chunks on the heap but right now (3am in the morning) I can’t think of an obvious way to leverage this for code execution
**** In some cases the address written might be constrained in such a way that a heap spray is required to ensure it is mapped. Same idea applies for reads. Another potential problem is that all writable data stores might be randomised. In this case heap spraying could again be useful.

Morphing shellcode using CFGs and SAT

A friend of mine is working on a project that involves building a metamorphic engine for x86 assembly. One of the core parts of such a project is a context free grammar describing valid mutations for a given instruction. To use one of his examples, an instruction to put v in r can be modelled in a BNF style grammar as follows:

put_v_in_r(v1,r1) := pushl_v(v1) popl_r(r1) | movl_v_r(v1,r1)

This could be quite useful in the process of automatic exploit generation. I expect the user to provide shellcode expressing what they want the exploit to do, but what if that shellcode is constrained by conditionals in the program? e.g. the shellcode is \x41\x42\x43\x44 but there is a conditional in the program that causes it to abort if the byte \x41 is detected in the input stream.

One solution in this situation, is to simply ask the user for different shellcode, but in the presence of complex modifications and conditions it quickly becomes impossible to provide any kind of meaningful feedback on what they should do to pass the filters. How about an automated solution? We have the conditions in propositional form and, due to the other project mentioned, we also have access to a grammar of potential shellcode mutations. If we can combine the two into a new propositional formula, describing the effect of mutating the shellcode using this grammar and the conditions from the program, we might be able to use the power of SMT/SAT solvers to brute force a solution! If a solution is found then we can preserve the effects of the original shellcode, as well as passing the filters.

The main difficulty in implementing this solution comes from encoding the grammar in propositional form. The typical solution to encoding a transition relation as a propositional formula involves iteratively applying the transition relation to existing states until a fixed point is reached. That is, starting with our original shellcode we would expand the first non-terminal to one or more potential shellcodes. We then expand these in the same fashion, and so on until no further new states are added; we then have the set of all potential shellcodes given the starting shellcode and the provided grammar. Obviously, there can be problems storing the states if the number grows exponentially on each expansion, but this is a common verification problem and there are a number of approaches to resolve it, varying from “hope it doesn’t happen” to using a compact storage mechanism like binary decision diagrams. We can then extract the states that contain no non-terminals, i.e. they are a valid x86 instruction stream, conjoin this with the path condition and ask our SAT/SMT solver to enumerate the shellcodes that satisfy these conditions.

There is a bit of a snag with this fixed point approach though. What if our grammar contains an expansion A ->* aB , which states that A expands to, some terminal, a and, some non-terminal, B in one or more steps, and also the rule B ->* A? In this case we are no longer guaranteed to reach a fixed point, as we can enter an infinite loop. The obvious solution is to transform our grammar into one that doesn’t contain such rules. We can do this by setting an upper bound, k, on the number of such recursive iterations allowed and including these expansions as new non-terminals. This new grammar has lost some of the potential outcomes of the original but it can now be processed using our fixed point approach. It should be noted that this upper bound on the number of self-recursions isn’t quite as artificial or as limiting as it may seem. There will always be some upper bound on the size of the buffer used for shellcode and thus it may be practical, in some cases, to use quite a low value of k.

(In fact, for a real world implementation it is well worth considering how to involve the length of the buffer we intend to use as we perform the fixed point computation. There is no reliable way to favour shorter shellcodes that also satisfy the program constraints, but one optimisation would be to exclude any shellcodes that have already exceeded the length limit from future expansions.)