Finding use-after-free bugs with static analysis

Earlier this year I had a lot of fun using run-time instrumentation and analysis to tackle a variety of program analysis problems. In doing so I became curious about static solutions to the common problems encountered during dynamic analysis. One of the most obvious issues is that you can only examine those paths that you can execute and thus you need to find some way to drive a program through different paths before you can do any kind of analysis. The flip side of that is that you are guaranteed that any path you analyse is valid and so false positives in that regard are avoided. With those experiences in mind I decided to try my hand at purely static analysis recently (recently as in August, I suck at finding time to document stuff apparently). Static analysis has a large variety of sources of error and annoyances of its own but for certain bug classes it seemed like a purely static approach could find some interesting results with a tolerable level of false positives.

Before I started I considered a few different bug classes in terms of the strengths/weaknesses of static analysis. Certain types of vulnerabilities that depend on the values in registers/memory locations are probably best avoided unless you have the time to either implement a hell of theoretical ideas that can limit the innaccuracy or have the time to sift through thousands of false positives. The bug classes I decided to focus on were those that could generally be reduced to deviations from expected orderings of ‘events’ on a path. Such bug classes include use of memory before initialisation, unchecked use of return values from functions, use after X vulnerabilities (where X is an API call that marks an argument as supposedly unusable e.g. free() or fclose()) and incorrectly ordered sequences of system calls. Eventually I settled on looking for use-after-free vulnerabilities, although the code built to do that works for almost all the mentioned bug classes with a little extending.

(Before continuing I should mention exactly what assumptions and bounds I put on the work I wanted to do. Firstly, I decided to do path insensitive analysis – that is I considered all paths through a control flow graph regardless of whether the path condition is satisfiable or not. To do otherwise would have been extra work and for the vulnerability classes I was considering it didn’t turn out to be a big problem.

Secondly, I decided to do data flow analysis over both registers and memory locations. Some static dataflow analysis (I think binaudit) only considers registers, but in my opinion that misses out on a lot of useful information available at practically zero cost. For instance, if we are doing taint analysis and eax is tainted then after mov [ecx+44], eax; I consider the abstract location denoted by the literal [ecx+44] to also be tainted. It remains in the tainted set until the ecx register is untainted but up until that point it is usable as a source of information.

Thirdly, this project was basically just to give me a taste of the practical side of static analysis so spending a month implementing dataflow analysis for the entire x86 instruction set was out of the question. As a result I implemented the data flow analysis for a small subset of the instruction set I considered to be most relevant to the bug classes considered and wrote a conservative default for all other instructions. That is, for any instructions not specifically considered I make an attempt to guess only their destination operand. The effective result of this is that for such instructions the destination will be removed from whatever analysis I’m doing. This potentially introduces false negatives but also frees up about a month of my life. 😉

Fourthly, when considering loops I only expanded them for one iteration. I haven’t thought too much about the implications of this as there didn’t seem to be a better method without investing time into implementing a bounded loop unwinding technique. It would seem like it might introduce false positives but in my testing I did not encounter anything that could be directly traced back to this weakness. It’s hard to tell if that was due to my small test set, the characteristics of the vulnerabilities I was considering or that my conservative data flow analysis hid the potential false positives.

Fifthly (at this point I’m wishing I’d used 1, 2, 3…), I inititally aimed at intraprocedural analysis and then tacked on some interprocedural fun at the end. I’m quite happy with how my interprocedural hackery worked out but there are probably better (and more time consuming) ways to do it.)

Finding equivalent variables

When doing static analysis you can essentially start at any address that suits the problem at hand. This allows us to divide our algorithm for finding use-after-free bugs into two parts (well three actually, but more about that later). The first part of the algorithm is to decide at a free call what memory locations and registers we want to consider free’d. We want to know all variables that contain pointers to the memory location about to be free’d so that we can consider a use of any of these variables to be a bug.

The most common theoretical framework for this problem is called use-def analysis and is common in compiler theory. In such an analysis we take a variable use (e.g. the push of whatever variable is to be free’d) and compute all its definitions at that point by traversing backwards over all paths to that point. We then want to apply this function recursively over all the definition variables we have found, treating them as uses. By modifying the standard use-def algorithm we can do this to find all variables equivalent to the free’d variable.

In my implementation as I traverse the paths backwards from a free call to find definitions I maintain a list of all locations that are used as destination operands in instructions i.e. they are rewritten before the free call. I will refere to this as the clearedVars list. So, if we are looking for the definition of a variable x and it is found at an instruction mov x, y; we know y is equivalent to the free’d variable if and only if y is not a member of clearedVars. Regardless of whether y is in clearedVars or not we then apply our use-def algorithm to find its definitions with the same considerations. This iterative approach is continued until all paths leading to the free call have been analysed and we are left with a set of variables equivalent to the argument to free.

The following snippet of Python code will hopefully help the explanation. It shows the logic applied to the destination operand of a mov instruction.

if dst in state['needDefList']:
    src = insDataFlow.getSrcFor(dst)
    state['needDefList'].append(src)
                    
    if src not in state['clearedVars']:
        if src not in self.equivVarList:
            self.equivVarList.append(src)
    
    state['needDefList'].remove(dst)
    state['clearedVars'].append(dst)

When finding the set of equivalent variables we can generally avoid many sources of false positives in static analysis by accepting false negatives. The main source of false positives I encountered (that weren’t caused by my lack of coding talent) derived from considering paths with unsatisfiable conditions.

So at this point we have an algorithm for computing a set of equivalent variables given a starting point and a source variable. In case it isn’t obvious, I should point out that the equivalent variables along each path should be stored separately. Otherwise it becomes an avoidable source of false positives in the next part of our algorithm.

Finding use-after-free instances

Most of the work for this algorithm is in the previous stage. Once we have computed the equivalent variables per backwards path we are ready to look for uses of these variables in an unsafe context. We also use the data flow information from each instruction to add/remove variables from the set of locations we consider to hold pointers to free’d memory.

For use-after-free vulnerabilities I considered unsafe use to be any instruction that used the free’d pointer in the computation of a memory address. (I also extended eventually this definition to include any instruction that pushed a free’d pointer to the stack). Checking for such a condition is relatively simple. For each instruction we just iterate over all base registers used in computing the pointer and determine if it is in the set of variables we know to hold free’d pointers or not.

def checkInsForFreeVarUse(self, ea, dst, src, state):
    for op in (dst, src):
        if isinstance(op, TaintLoc_MEMREF_REGS):
            # Usage e.g. mov [ebp+10], eax; and ebp is a free'd var
            for usedReg in op.regs:
                if usedReg in state['freedVars']:
                    self.b00m(ea, usedReg)

In my code the TaintLoc_MEMREF_REGS class represents an operand that is a memory location constructed using at least one register. The objects src and dst are the source and destination operands for a given instruction (a single source and destination suffices for the instructions instrumented in this case).

That is basically the bones of an algorithm that can operate within the bounds of any function containing a call to free; this is useful in its own right but as I thought about it I began to consider cases where a function may free one of its arguments. This could be a problem in two ways: 1) The programmer using the API call is unaware of this functionality, or forgets, or 2) The function should not free the argument and that it does is a flaw in its implementation. The latter case is more interesting in my opinion but the same approach will detect both.

Function aliases and interprocedural analysis

There is probably some other name for what I’m about to describe but let’s call it a ‘function alias’ for now. True interprocedural analysis of a function f containing a call to free would involve relatively complex analysis of the relationship between non-local variables in f and the variables defined at the call sites of f. Doing this properly is non-trivial but we can perform a more limited analysis with much less effort.

The type of interprocedural analysis I implemented begins with searching for aliases to the API call that we are using as the starting point for our algorithms. For use-after-free bugs this is the free function. Let us consider this trigger function as a tuple (f, n), where f is the function in question and n is the argument we consider to be free’d or closed or whatever the action we are interested in happens to be. An alias for a function (f, n) is then a function (g, m) such that calling the function g with the variable x as argument m results in a call to f with x as argument n. If we are provided with one or more aliases for (f, n) we can then apply the exact same analysis algorithm as described in the previous sections to find interprocedural bugs.

The question then becomes how do we build such a list? The answer is pretty simple at a high level – say we are in a function g and analysing a call to the function f that free’s its nth argument. Our first algorithm will build the set of variables equivalent to the nth argument to f, let’s call it E. If any of g‘s arguments are in this set then we know (g, m) is an alias for (f, n) where m is the index of that variable. The only tricky part of this is tracking the function arguments to g. IDA takes care of this for us and renames memory locations automatically using an identifier similar to arg_X but if using another framework you may have to do that yourself. (This turns out to be trickier than you might think, a description of how the simplex method is used in IDA can be found here.)

Righty, so at this point we can take any function and compute its aliases. For maximum fun we apply this recursively to find aliases for aliases and so on, applying our basic analysis algorithms to each discovered alias in turn.

That basically sums up the details of my approach. I used it to look for use-after-free bugs but with varying amounts of effort it could be changed to look for all sorts of fun stuff.

Testing

I tested my implementation (~850 lines of Python using IDAPython as a framework) on libxml2, which has a history of use-after-free bugs, Clam Antivirus, which has a history of sucking and TightVNC, because the name makes me giggle.

The following table summarises the results of my rather brief and unscientific testing.

Analysis results
Analysis results

Both libxml2 and Clam AV came up good for a few bugs each and multiple aliases for free. Some of these aliases were obviously on purpose with names like cl_engine_free, whereas others were more difficult to determine. All three results found in Clam AV were as a result of analysis on an alias function so it is possible that this function has unintended side effects. For libxml2 the bugs were split, two resulting from direct calls to free and two resulting from a call to an alias function. Despite its giggle inducing name, TightVNC showed no bugs and, disappointingly, not even an alias to free. For both libxml2 and Clam AV the false positives were as a result of paths with unsatisfiable conditions.

Conclusion

In the above table I’ve purposely used the term ‘bugs’ instead of ‘vulnerabilities’. The reason for this is the same reason that I’ve now moved on to investigating hybrid dynamic/static implementations. Finding a bug using static analysis of the type I have described gets you a location in a binary where some broken coding idiom occurs. This is useful in its own right and all valid bugs detected during my testing are concrete paths leading to use of a dangling pointer.

From an offensive point of view we must consider the issue further though. Without more analysis you have no idea how to trigger this code or even if it is triggerable in a malicious fashion. In fact, as soon as you get over the initial rush of ‘oh look, a shiney new bug’ you quickly realise you have a mountain of work to actual exercise that code, let alone craft an exploit. From a defensive point of view none of this really matters, the code is incorrect regardless and should be fixed but from an exploit writers point of view discovering the bug in this fashion merely puts your shoes on. You still have a marthon to run in order to get to an exploit.

Game Over! Thank you for playing Academia

I’ve recently finished my Msc dissertation, titled “Automatic Generation of Control Flow Hijacking Exploits for Software Vulnerabilities“. A PDF copy of it is available here should you feel the need to trawl through 110 or so pages of prose, algorithms, diagrams and general ramblings. The abstract is the following:

Software bugs that result in memory corruption are a common and dangerous feature of systems developed in certain programming languages. Such bugs are security vulnerabilities if they can be leveraged by an attacker to trigger the execution of malicious code. Determining if such a possibility exists is a time consuming process and requires technical expertise in a number of areas. Often the only way to be sure that a bug is in fact exploitable by an attacker is to build a complete exploit. It is this process that we seek to automate. We present a novel algorithm that integrates data-flow analysis and a decision procedure with the aim of automatically building exploits. The exploits we generate are constructed to hijack the control flow of an application and redirect it to malicious code.

Our algorithm is designed to build exploits for three common classes of security vulnerability; stack-based buffer overflows that corrupt a stored instruction pointer, buffer overflows that corrupt a function pointer, and buffer overflows that corrupt the destination address used by instructions that write to memory. For these vulnerability classes we present a system capable of generating functional exploits in the presence of complex arithmetic modification of inputs and arbitrary constraints. Exploits are generated using dynamic data-flow analysis in combination with a decision procedure. To the best of our knowledge the resulting implementation is the first to demonstrate exploit generation using such techniques. We illustrate its effectiveness on a number of benchmarks including a vulnerability in a large, real-world server application.

The implementation of the described system is approx. 7000 lines of C++. I probably won’t be releasing the code as I’m fairly sure I signed over my soul (and anything I might create) to the University earlier in the year. The two core components are a data-flow/taint analysis library and higher level library that uses the previous API to perform data-flow/taint analysis over x86 instructions (as given to us by Pin). Both of these components are useful in their own right so I think I’m going to do a full rewrite (with added GUI + DB) and open source the code in the next couple of months. Hopefully they’ll prove useful for others working on dynamic analysis problems.

Exploit generation, a specialisation of testing?

It sounds like a silly question, doesn’t it? Nobody would consider exploit development to be a special case of vulnerability detection. That said, all research on exploit generation that relies on program analysis/verification theory (From now on assume these are the projects I’m discussing. Other approaches exist based on pattern matching over program memory but they are riddled with their own problems.) has essentially ridden on the coat-tails of research and tools developed for test-case generation. The almost standard approach to test-case generation consists of data flow analysis in combination with some sort of decision procedure. We then generate formulae over the paths executed to create inputs that exercise new paths. This is also the exact approach taken by all exploit generation projects.

There are pros and cons to this relationship. For instance, some activities are crucial to both test-case generation and exploit generation, e.g., data flow and taint analysis. Algorithms for these activities are almost standardised at this stage and when we work on exploit generation we can basically lift code from test generation projects. Even for these activities though there are sufficient differences and opportunities presented by exploit generation that it is worth doing some re-engineering. For example, during my research I extending the taint analysis to reflect the complexity of the instructions involved in tainting a location. When building a formula to constrain a buffer to shellcode we can then use this information to pick the locations that result in the least complex formulae. An exploit only needs a single successful formula (usually) so we can pick and choose the locations we want to use; testing on the other hand typically requires exhaustive generation and thus this optimisation hasn’t been previously applied because the benefits are less evident (but still might be a decent way of increasing the number of test cases generated in a set time frame).

The two problems share other similarities as well. In both cases we find ourselves often dissatisfied with the results of single path analysis. When generating an exploit the initial path we analyse might not be exploitable but there may be another path to the same vulnerability point that is. Again in this case we can look to test case generation research for answers. It is a common problem to want to focus on testing different sub-paths to a given point in a program and so there are algorithms that use cut points and iterative back-tracking to find relevant paths. So with such research available one might begin to think that exploit generation is a problem that will be inadvertently ‘solved’ as we get better at test case generation.

Wrong.

With test case generation all test cases are essentially direct derivatives from the analysis of a previous test case. We build a formula that describes a run of the program, negate a few constraints or add on some new ones, and generate a new input. Continue until boredom (or some slightly more scientific measure). What I am getting at is that all the required information for the next test is contained within the path executed by a previous test. Now consider an overflow on Windows where we can corrupt the most significant byte of a function pointer that is eventually used. If you decide to go down the ‘heap spray’ route to exploit this vulnerability you immediately hit a crucial divergence from test case generation. In order to successfully manipulate the structure of a programs heap(s) we will almost always require information that is not contained in the path executed to trigger the vulnerability initially. Discovering heap manipulation primitives is a problem that requires an entirely different approach to the test case generation approach of data flow analysis + decision procedure over a single path. It is also not a problem that will likely ever be solved by test case generation research as it really isn’t an issue in that domain. Whole classes of vulnerabilities relating to memory initialisation present similar difficulties.

What about vulnerability classes that fit slightly better into the mould carved out by test generation research? One of the classes I considered during my thesis was write-4-bytes-anywhere style vulnerabilities. Presuming we have a list of targets to overwrite in such cases (e.g. the .dtors address) this is a solvable problem. But what if we only control the least significant byte (or word) and can’t modify the address to equal one of the standard targets? Manually one would usually see what interesting variables fall within the controllable range, looking for those that will be at a static offset from the pointer base. But what is an ‘interesting variable’? Lets assume there are function pointers within that range. How do we automatically detect them? Well we’d need to monitor the usage of all byte sequences within the range we can corrupt. It’s a problem we can approach using data flow/taint analysis but once you start to consider that solution it starts to look a lot like a multi-path analysis problem but over a single path. We are no longer considering just data that is definitely tainted by user input, we are considering data that might be, and as we can only control a single write we have different ‘paths’ depending on what bytes we choose to modify….. and we’re doing this analysis over a single concrete path? Fun.

I guess the core issue is that test-case generation and exploit generation are close enough that we can get adequate results by applying the algorithms developed for the former to the latter. To get consistently good results though we need to consider the quirks and edge cases presented by exploit generation as a separate problem. Obviously there are many useful algorithms from test case generation research that can be applied to exploit generation but to apply these blindly misses opportunities for optimisations and improvements (e.g. the formula complexity issue mentioned). On the other hand there are problems that will likely never be considered by individuals working on test case generation; these problems will require focused attention in their own right if we are to begin to generate exploits for certain vulnerability classes.

Automatic exploit generation: Lessons learned so far

Here are a couple of thoughts that are bouncing around in my head as I come to the concluding stages of my v1 prototype. I’ve made a number of design decision in the initial implementation that I now see issues with, but hopefully by documenting them v2 will be better!

Using in-process analysis tools might not be such a good idea: Early on I decided to use dynamic analysis to gather information about taint data propagation and path conditions. Due to previous work (such as catchconv) using dynamic binary instrumentation frameworks, like Valgrind, I pretty much immediately decided I would do the same. After writing a couple of basic apps for Valgrind, Pin and DynamoRio, I eventually settled on Pin due to its cross platform support and C++ codebase. One critical factor I ignored is that these tools really aren’t designed with malicious code in mind. So, when you do things like trash a stored instruction pointer it can really confuse the DBI tool.

Other problems can occur if the vulnerability ends up writing over several hundred megabytes of junk in the application address space. This can lead to difficult to debug problems, where the memory in use by the injected analysis client is being corrupted, as well as that of the application under test.

More basic, but just as time consuming, problems stem from the fact that these in-process analysis clients are rather difficult to debug once something goes wrong. The three frameworks mentioned vary in their support for debugging and error logging, but in general it is exceedingly annoying to debug anything non-trivial. Simple segfaults have eaten hours of my time and often you are left resorting to printf based ‘debugging’.

The final issue I’ve come across is obvious, but should still be mentioned. Complex runtime instrumentation, such as dataflow analysis, really effect the responsiveness and runtime of the application. My current code, which is admittedly incredibly unoptimised, can increase the runtime of ls from milliseconds to about 20 seconds. This isn’t much of an issue if you don’t need to interact with the application to trigger the vulnerability, but in a case where some buttons need to be clicked or commands entered, it can become a significant inconvenience.

Assisted may be better than automated: The idea of this project is to investigate what vulnerability classes are automatically exploitable, and to develop a prototype that can show the results. I’ve achieved this goal for a sufficient variety of vulnerabilities and shown that automation is in fact possible.

There is a but here though; to continue this project would require constant attention and coding to replace the human effort in exploit generation each time a new class of vulnerability comes along, or a change in exploit technique. As I’ve moved from basic stack overflows to considering more complicated scenarios, the differences in exploit types become more time consuming to encode. Because of this, I intend (when I implement v2 of the tool in the coming months) to move away from complete automation. By putting the effort into providing a decent user interface, it will be possible to inform an exploit writer of the results of data flow and constraint analysis and have them make an educated judgement on the type of exploit to attempt, and specify some parameters. Working from this point of view should make the entire tool much less effort to port between operating systems also.

Information on memory management is very important: This is an obvious point for anyone that has had to write heap exploits in the last 5 years or so. It is near impossible to automatically generate a Linux heap exploit without having some information on the relationship between user input and the structure of the processes heap. When manually writing an exploit we will often want to force the program to allocate large amounts of memory, and the usual way to do this involves jumping into the code/disassembly and poking around for a while until you find a memory allocation dependent on the size of some user supplied field, or a loop doing memory allocation with a user influenceable bound.

Essentially, to have a solution that takes care of both scenarios we need a way to infer relationships between counters and program input. The first paper to discuss how to do this using symbolic execution was published in March of this year, and is a good read for anyone considering implementing this kind of tool.

As I hadn’t added loop detection, or other required functionality described in that paper, my current tool is unable to do the analysis described. I consider this a rather annoying drawback, and it will be among my highest priorities for v2. Hacky solutions are possible, such as modelling the result of strlen type functions on user input, but this would miss a number of scenarios and is in general quite an ugly approach.

Extending to new vulnerability classes

One of the things I really want to support is automatic generation of exploits for modern heap based vulnerabilities. At the moment I have other features to implement so this has gotten pushed back for the time being. In the meantime, I wanted to see how hard it was to extend my current functionality from stack overflows that trash the EIP to other vulnerability/exploit combinations. So, yesterday I went and found an example from an wargame I used play. It contains a pretty silly vulnerability, as follows:

(The aim here was to see how hard it was to extend my current code, not to see how complicated I could make a test case. The array pointed to by input is controllable by the user and 256 bytes in size. The function some_func is some benign function that just exits the program)

void func_ptr_smash(char *input)
{
    void (*func_ptr)(int);
    char buffer[248];
    
    func_ptr = &some_func;
    strcpy(buffer, input);

    (*func_ptr) (z);
}

Extending the tool turned out to be pretty simple. In a stack smash that overwrites the stored EIP we attempt to generate a query that expresses the constraint memLocation == trampolineAddr, where memLocation is the value in ESP at the ret instruction, and trampolineAddr is the address of some usable trampoline.

Modifying this to handle a function pointer overwrite can be done in a couple of ways, depending on what parts of the address space are randomised and how generic we want the solution to be. The most straight forward solution is simply to treat a function pointer overwrite like a slight twist on the previous situation. Essentially, instead of a ret instruction popping a tainted value into the EIP, which we can then redirect to a trampoline, we have a call instruction where the argument is tainted and can be sused to redirect to a trampoline. So, instead of generating a constraint on the value of ESP we have to express the constraint on whatever register/memory location the call instruction uses as an argument.

Another potential approach is usable if some non-randomised data locations exist that we can use as a shellcode buffer. Once again the same data flow analysis can be used to find the location of a suitable home for shellcode in these areas. In this case we avoid the requirement for a register to contain the address of one of these buffers and can just jump right into it by generating a constraint that specifies the operand to the call is equal to the static memory address of our shellcode.

Here is an example run of the tool in which we use the latter method. I tested it on Ubuntu 8.04 which has a non-randomised heap by default and thus we can hardcode the address of shellcode buffers on the heap. The vulnerable program was compiled with -O0 -fno-stack-protector -D_FORTIFY_SOURCE=0, otherwise gcc would have repositioned the overflow buffer and optimised the call so that the address is calculated at compile time.

In conclusion, my current approach was easily extendible in this case due to the similarity between a function pointer overwrite and smashing the stored EIP. Both cases essentially only differ at the point where the tainted memory location or register is moved to the EIP. They are detectable in exactly the same way and have the same symptom; namely, the attempted movement of tainted data into the EIP register. I would hypothesise that any vulnerability* with the same symptom can be dealt with in a similar way. So then we must ask, what other symtoms are there? Well, what can cause a program to crash? We have seen the case where the program attempts to execute data at an unmapped memory location, so that leaves invalid reads and writes.

Exploiting vulnerabilities detected via an invalid read/write

Old glibc heap exploits are a simple example of those that would be detectable as a result of an invalid write. I won’t go into the details of the method but the unlink macro essentially has the following effect, where next->fd and next->bk are under our control:

 *( next->fd + 12 ) = next->bk
 *( next->bk + 8 ) = next->fd

In this case the vulnerability will probably be discovered when the application tries the first write. This is a crucial difference between the earlier vulnerabilities and this one when considering how to automatically generate an exploit. Detecting the potential for an exploit is still simple – when an instruction writes memory we can simply check the destination operand to see if it is tainted**. The problem now becomes ‘how do we automatically determine a useful location to write?’. Assuming all data sources are randomised our options are limited somewhat. Depending on the protections in place the DTORS/GOT might be usable. Other than that (and depending on what the value we are corrupting actually points to) we could potentially just change the lower bytes of the location being written and attempt to modify some useful program variable on the heap***. This has the disadvantage of requiring some sort of input from the user, as determining what is a useful variable would seem to be mostly application specific.

If the location being written is on the stack we could potentially modify the lower bytes to change a stored EIP/EBP and then proceed in the same fashion as before. To automate this we could note the value of the ESP when a call instruction occurs and calculate the difference between this value and the location of the variable being written.

To sum these options up we have two potential output types. The first is a direct exploit; It writes the DTORS/GOT or changes the lower bytes of a stack variable to modify a stored EIP and may or may not be possible depending on the protections in place and the location being written. The second is a new program input that may lead to another crash at a different location. It is basically another fuzz input except with potentially some program variables corrupted. In this case, the result is that the program can continue execution and we hope that it hits another exploitable crash further on****.

For a read from an invalid address we also have a couple of options. If the variable being read into is later used for something such as memory allocation or as a bound on a loop/function involved in moving memory we might attempt to control this value. To do this automatically would require some form of static analysis to determine if the variable was ever used in such a context (or perhaps multiple runs of dynamic analysis). Once again, as in the write case, we also have the option to just manipulate the destination read in such a way that it is any valid address and hope that another vulnerability is later triggered. Unlike the write case this is pretty trivial though as we can just read from any address in the .text section.

* Obviously this doesn’t apply to certain vulnerability classes. It applies to most I can think of that we will exploit by somehow changing the execution pointer. Not trust problems, many race conditions, other design flaws etc.
** Some slightly more complicated analysis might be required here as sometimes a destination operand in a write can be legitimately tainted but restricted to the bounds of some safe chunk of data
*** We could also change the metadata of some other chunks on the heap but right now (3am in the morning) I can’t think of an obvious way to leverage this for code execution
**** In some cases the address written might be constrained in such a way that a heap spray is required to ensure it is mapped. Same idea applies for reads. Another potential problem is that all writable data stores might be randomised. In this case heap spraying could again be useful.

Gathering constraints from conditional branches

Dataflow analysis without considering the effects of conditional branching is of limited use when trying to generate an exploit or find a vulnerability. For lightweight analysis the inaccuracies introduced may be acceptable, but when creating an input that should drive the program to a specific point we must consider all conditional branches affected by user input and represent the inherent restrictions in the path condition.

Consider the example:

cmp eax, ebx
jg VULNERABLE_FUNCTION

In this case, if either eax or ebx are tainted then our choice of input can influence the condition. We would therefore like to have this condition represented in the formula we will generate to represent the path condition. We can achieve this be separating intructions into two categories; the set A, those that write the EFLAGS, and the set B, those that read the EFLAGS. Associated with every entry in A and B is the list of EFLAGS it writes or reads respectively. We represent the EFLAGS register using vector E <e_0, e_1, ... , e_31>. Each entry in this vector stores a list containing the operands of the last instruction to write this particular eflag. If the last instruction to write the eflag was not tainted by user input then the operand list will be empty.

(In practice each entry in the vector E is an object containing 1 to N vectors, where N is the number of operands of the instruction that wrote e. Each operand vector is of size 1 to DWORD_SIZE, with each entry representing the taint information for a single byte of that operand. The cost of instrumenting at the byte level is slower runtimes due to the extra overhead, but it means we gain accuracy and expressiveness in our formulae)

Using this infrastructure we can then begin to record the operands involved in instructions that write the EFLAGS register at runtime. Our instrumentation routine for every instruction ins looks something like this:

if ins in A
    operandList = ins.operandList;
    if (operandList.tainted)
        for eflag in ins.eflagsWritten:
            eflag.operandList = operandList;
    else
        for eflag in ins.eflagsWritten:
            eflag.operandList = [];

Once this algorithm is in place we can then process instructions from set B, those that read the EFLAGS register. For now I will just consider conditional jumps and exclude other instructions like cmpxchg, adc etc. The majority of eflag write/read combinations can be easily expressed by simply taking any of the eflags read by a given instruction, extracting the operand list (if there is one), and directly encoding the semantics of the condition on the operands. e.g. if the instruction is cmp x, y and the condition is jg, and the jump is taken, then we can express this as (> x y). To determine whether the jump has been taken or not, and hence whether the condition should be negated, we take an approach similar to that of Catchconv/Smartfuzz. That is, on the taken branch we simply insert a callback to our analysis routine with the condition and on the fallthrough path we insert a callback with the negated condition. Once this callback is triggered we basically have all the components required to store the condition for later processing – the operands involved, the conditional jump and whether the condition evaluated to true or false.

When we decide to generate an input it becomes necessary to convert these conditions to an SMT-LIB compliant formula. We first filter out the relevant conditions. Basically we only need include information on conditions if our changes to the original input might result in a different outcome. This is relatively simple to determine. For every condition we can extract all TaintByte objects that represent the taint information and dataflow history of a given byte. We can trace this back to a specific byte from user input and if we then decide to change this byte, the conditions it occur in will be included in the formula. SMT-LIB format has support for concatenation of bit-vectors so we generate our formula by concatenating the information on single bytes into words and double words where necessary. For example, if some of our input was treated as a 4 byte integer and compared to 0 we would end up with conditions like (not (= (concat n1 n2 n3 n4) bv0[32])), which expresses the concatenation of 4 bytes (which would earlier be declared as being of size 8 ) and the condition that these 4 bytes should not equal 0 when concatenated together. SMT-LIB bitvector arithmethic is modulo-X, where X is the size of the bit-vectors involved, and so it accurately models integer overflows/underflows.

(I mentioned that this approach works for most instructions. One case where some hackery is required is the test instruction. Typically something like test eax, eax is followed by a jz/je instruction. Obviously the intention here is that we are checking if eax is 0, not checking if eax == eax . This situation is easily detected by simply comparing the operands that set the zero flag on a jz/je for equality)

To demonstrate how these conditions can be gathered and used I have put together an annotated run of my prototype against a simple vulnerable program. It runs user input through a filter to ensure it is alphanumeric and if so a strcpy vulnerability is triggered. It can be found here. To get an idea of the kind of SMT constraints this generated I’ve also uploaded the entire formula here. The constraints that exist specifying every byte cannot be 0 are gathered from strcpy.

Morphing shellcode using CFGs and SAT

A friend of mine is working on a project that involves building a metamorphic engine for x86 assembly. One of the core parts of such a project is a context free grammar describing valid mutations for a given instruction. To use one of his examples, an instruction to put v in r can be modelled in a BNF style grammar as follows:

put_v_in_r(v1,r1) := pushl_v(v1) popl_r(r1) | movl_v_r(v1,r1)

This could be quite useful in the process of automatic exploit generation. I expect the user to provide shellcode expressing what they want the exploit to do, but what if that shellcode is constrained by conditionals in the program? e.g. the shellcode is \x41\x42\x43\x44 but there is a conditional in the program that causes it to abort if the byte \x41 is detected in the input stream.

One solution in this situation, is to simply ask the user for different shellcode, but in the presence of complex modifications and conditions it quickly becomes impossible to provide any kind of meaningful feedback on what they should do to pass the filters. How about an automated solution? We have the conditions in propositional form and, due to the other project mentioned, we also have access to a grammar of potential shellcode mutations. If we can combine the two into a new propositional formula, describing the effect of mutating the shellcode using this grammar and the conditions from the program, we might be able to use the power of SMT/SAT solvers to brute force a solution! If a solution is found then we can preserve the effects of the original shellcode, as well as passing the filters.

The main difficulty in implementing this solution comes from encoding the grammar in propositional form. The typical solution to encoding a transition relation as a propositional formula involves iteratively applying the transition relation to existing states until a fixed point is reached. That is, starting with our original shellcode we would expand the first non-terminal to one or more potential shellcodes. We then expand these in the same fashion, and so on until no further new states are added; we then have the set of all potential shellcodes given the starting shellcode and the provided grammar. Obviously, there can be problems storing the states if the number grows exponentially on each expansion, but this is a common verification problem and there are a number of approaches to resolve it, varying from “hope it doesn’t happen” to using a compact storage mechanism like binary decision diagrams. We can then extract the states that contain no non-terminals, i.e. they are a valid x86 instruction stream, conjoin this with the path condition and ask our SAT/SMT solver to enumerate the shellcodes that satisfy these conditions.

There is a bit of a snag with this fixed point approach though. What if our grammar contains an expansion A ->* aB , which states that A expands to, some terminal, a and, some non-terminal, B in one or more steps, and also the rule B ->* A? In this case we are no longer guaranteed to reach a fixed point, as we can enter an infinite loop. The obvious solution is to transform our grammar into one that doesn’t contain such rules. We can do this by setting an upper bound, k, on the number of such recursive iterations allowed and including these expansions as new non-terminals. This new grammar has lost some of the potential outcomes of the original but it can now be processed using our fixed point approach. It should be noted that this upper bound on the number of self-recursions isn’t quite as artificial or as limiting as it may seem. There will always be some upper bound on the size of the buffer used for shellcode and thus it may be practical, in some cases, to use quite a low value of k.

(In fact, for a real world implementation it is well worth considering how to involve the length of the buffer we intend to use as we perform the fixed point computation. There is no reliable way to favour shorter shellcodes that also satisfy the program constraints, but one optimisation would be to exclude any shellcodes that have already exceeded the length limit from future expansions.)

Fun uses for an SMT solver

An SMT solver (such as Z3, Yices or STP) is a decision procedure that can handle various types of arithmetic and other decidable theories. They make use of procedures specific to the theories they handle, such as linear arithmetic, in combination with the brute force power of a SAT solver. At a high level, proceed by iteratively by replacing the sub-expressions in a formula like (x + y < 10 || x > 9) && (y + z < 5) with propositional variables, to give something like (A || B) && (C). At this point we can apply a SAT solver to look for a solution. If none exists then we need not bother with analysing the abstracted expressions, as the formula is unsatisfiable. If the SAT solver finds a satisfiable solution we then restore the original expressions and pass the conjunction of this off to the core decision procedure which can deal with the semantics of the given theory. This process then proceeds in the standard DPLL method of iteration, involving finding UNSAT cores and backtracking. One of the best collection of resources on how they actually work is Leonardo De Moura’s MSR research page.

So, theory aside, how can we use this to entertain ourselves/do useful things? Well, an advantage of an SMT solver over a regular SAT solver is that we can quite easily express the sort of operations that tend to go on during program execution. We can model conditions, arithmetic and arrays. Using a theory that can model operations of this kind we can represent a path through a program (or several different paths in fact). By appending extra constraints we can then use an SMT solver to generate inputs that will take different conditional branches (useful for guiding a fuzzer), ensure memory locations have a specific value (useful for avoiding shellcode filters) and/or model conditions we want to check for, such as integer overflows.

A practical example may help illuminate why you might want to use an SMT solver. Consider the problem of a program containing an exploitable vulnerability where we have located a suitable buffer for shellcode but we now need to know what input to provide such that the desired shellcode is in that buffer when we jump to it. One solution is to use dynamic analysis to gather the entire path condition over user influenced data and then append the condition that the buffer we are interested in contains the shellcode. To do this we will gather all data movement and conditional instructions on tainted data (I won’t discuss how to do this here). At the vulnerability point we can then express this trace as the input to an SMT solver. Most solvers have their own API but they should also all accept a SMT-LIB formatted text file input, which is what I use so as not to tie myself to a single solver. You can read more about the format here, but essentially our input will have 4 sections.

Suppose, for the sake of having a tractable example, that our vulnerable program just takes two bytes of input and moves them once into a new location, and we want to determine what input to provide such that these new locations have the values 0x41 and 0x42. Our specification to the solver would then proceed as follows:

We begin with a pretty standard header that is basically static unless you want to use a different logic. I use quantified bitvector logic because it is easier to model mod32 arithmetic and data movement/conditionals at the byte and bit level than it would be with linear arithmetic or another logic.

(benchmark exploitSample
:status unknown
:logic QF_BV

Following this we then specify the name and type of all variables that we intend to use in the formula itself. The format of valid names is discussed in a document linked from the SMT-LIB website, but approximately follows the same rules as C.

Here I declare 4 variables, <i0, i1, n256, n257> (names unimportant), and I declare them to be bitvectors of site 8 e.g. they model a single byte

:extrafuns ((n256 BitVec[8])(i0 BitVec[8])(n257 BitVec[8])(i1 BitVec[8]))

Next is the assumptions section. Here we can specify the path condition, i.e. all data movement and conditionals that occured during the dynamic trace. We could just as easily express these in the next section but for ease of comprehension I use the assumptions section (according to the docs some parsers might also work slightly faster this way).

The format of this section should be familiar with anyone that has dabbled in programming languages that are list orientated. It is pretty much (operator operand1 operand2 ....)

This assumption is basically the expression of the conjuction (n256 := i0) AND (n257 := i1)

:assumption (and (= n256 i0)(= n257 i1)

Finally, we have our formula. This is of the same format as the assumption section, but I usually use it to express the part of the problem I want to solve. e.g. I encode the shellcode here.

(The form for a bitvector constant is bvDECIMAL_VAL[SIZE_OF_VECTOR])

:formula (and (= n256 bv65[8])(= n257 bv66[8]))

Obviously, this is a trivial example and we can quite easily spot the solution, but in a situation where there are 30+ movements of every byte of input, as well as conditionals and arithmetic, it quickly becomes impossible to solve by hand. I’ve used this technique to produce satisfying inputs for formulae over hundreds of input variables in a matter of seconds. As for the actual running, we can concatenate the above sections into a text file (probably best to use a .smt extension as some solvers seem to look for it) and invoke our solver to get something like the following:

nnp@test:~/Desktop/yices-1.0.21/bin/$ yices -smt -e < constraints.smt 
sat
(= i0 0b01000001)
(= i1 0b01000010)

Which can be parsed back to a C/Python/etc array using a relatively simple script.

This approach is exactly what I’m doing to auto-generate my exploits. In the assumptions, I specify everything I’ve traced during program execution, and in the formula I specify the shellcode and the desired location I want to use it in (determined by the method described in a previous post of mine), as well as the address of a shellcode trampoline. By also including the information on what the original user input was, the SMT solver can produce an output with the exploit information at the desired indexes and valid input at all others. The finished product is currently something that looks like this:

exploitArray = ['\x99\x31\xc0\x52\x68\x6e\x2f\x73\x68\x68\x2f\x2f\x62\x69\x89\
xe3\x52\x53\x89\xe1\xb0\x0b\xcd\x80\x58\x58\x58\x58\x58\x58\x58\x58\x42
\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42
...
42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x39\x84\x04\x08\x42\x42\x42\x42
...
']

oFile = open('pyExploit.in', 'w')
oFile.write(''.join(exploitArray))
oFile.close()

Which was produced from an input file containing only ‘B’s, the required shellcode and a set of potential trampoline addresses. The tool has determined that at the vulnerability point the ESP register points to a buffer containing data from the start of the user input, and thus has filled in the start of the input with the shellcode. It has also found the correct bytes that overwrite the EIP and replaced them with the address of a jmp %esp (0x08048439)

Pin problem solved!

After some discussion on the Pin mailing list, I came to the conclusion that the best approach to my previous issue with Pin is simply to check the integrity of all values that get put in the EIP register from a potentially tainted source. This is arguably what I should have been doing anyway to detect the potential for an exploit. This means instrumenting ret, call, loop, jmp and its conditional variants, and checking the integrity of the value being used. In this context, this means checking that the value isn’t tainted by user input, and it is a fairly trivial task. For example, this code is called from my instruction instrumentation routine, registered with Pin:

VOID
processRET(INS ins)
{
    INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(checkRETIntegrity),
            IARG_REG_VALUE, LEVEL_BASE::REG_ESP,
            IARG_CONTEXT,
            IARG_INST_PTR,
            IARG_END);
}

Which results in the following function being called before any ret instructions are executed, and passed the value currently stored in the ESP register:

(tmgr is a custom object that maintains taint propagation information)

VOID
checkRETIntegrity(ADDRINT espVal, CONTEXT *ctx, ADDRINT pc)
{
    if (tmgr.isMemLocTainted(espVal, DWORD_SIZE)) {
        postRunAnalysis(true, ctx, TAINTED_RET);
        exit(0);
    }
}

Using this same technique I can also check for corruption of the stored frame pointer. I should also extend these checks to read/writes from and to memory. The problem gets a little more complex here as it is quite common for user input to taint the offset of a read/write. The only practical approach that springs to mind is to check the location being read/written is mapped; as attempting to detect if the tainted index is within a certain bound involves solving a whole other array of associated problems.

The romance is over…

..with Pin that is. Pin has been my DBI framework of choice for all of 3 weeks now and I haven’t had a single problem, until today. It would appear that smashing the stored EIP causes all sorts of problems for the Pin analysis code. What appears to happen, is that at the return from the function, Pin assumes the stored EIP is a valid instruction and it passes it to its analysis engine. If the smashed EIP points to the heap for example, Pin will then start to disassemble the data located there in an attempt to find the end of the basic block. If then keeps progressing through memory until it is eventually killed by the operating system (presumably for trying to disassemble an unreadable address).

The following code illustrates the issue:

#include <stdlib.h>

void smashySmashy(char *heapArr)
{
    asm("movl %0, 4(%%ebp)\n\t" 
        :   
        : "D"(heapArr));
}

int main(int argc, char *argv[])
{
    char *heapArr = NULL; 
    heapArr = malloc(256*sizeof(char));
    
    smashySmashy(heapArr);

    return 0;
}

Running the above code with any of the Pin instrumentation tools will result in the Pin process eventually aborting.

This is quite the showstopper for me, in terms of my usage of Pin, as functions like this are exactly the kind I need to analyse for exploit generation. I’ll update this post when I get a response from the developers.