Extending to new vulnerability classes

One of the things I really want to support is automatic generation of exploits for modern heap based vulnerabilities. At the moment I have other features to implement so this has gotten pushed back for the time being. In the meantime, I wanted to see how hard it was to extend my current functionality from stack overflows that trash the EIP to other vulnerability/exploit combinations. So, yesterday I went and found an example from an wargame I used play. It contains a pretty silly vulnerability, as follows:

(The aim here was to see how hard it was to extend my current code, not to see how complicated I could make a test case. The array pointed to by input is controllable by the user and 256 bytes in size. The function some_func is some benign function that just exits the program)

void func_ptr_smash(char *input)
{
    void (*func_ptr)(int);
    char buffer[248];
    
    func_ptr = &some_func;
    strcpy(buffer, input);

    (*func_ptr) (z);
}

Extending the tool turned out to be pretty simple. In a stack smash that overwrites the stored EIP we attempt to generate a query that expresses the constraint memLocation == trampolineAddr, where memLocation is the value in ESP at the ret instruction, and trampolineAddr is the address of some usable trampoline.

Modifying this to handle a function pointer overwrite can be done in a couple of ways, depending on what parts of the address space are randomised and how generic we want the solution to be. The most straight forward solution is simply to treat a function pointer overwrite like a slight twist on the previous situation. Essentially, instead of a ret instruction popping a tainted value into the EIP, which we can then redirect to a trampoline, we have a call instruction where the argument is tainted and can be sused to redirect to a trampoline. So, instead of generating a constraint on the value of ESP we have to express the constraint on whatever register/memory location the call instruction uses as an argument.

Another potential approach is usable if some non-randomised data locations exist that we can use as a shellcode buffer. Once again the same data flow analysis can be used to find the location of a suitable home for shellcode in these areas. In this case we avoid the requirement for a register to contain the address of one of these buffers and can just jump right into it by generating a constraint that specifies the operand to the call is equal to the static memory address of our shellcode.

Here is an example run of the tool in which we use the latter method. I tested it on Ubuntu 8.04 which has a non-randomised heap by default and thus we can hardcode the address of shellcode buffers on the heap. The vulnerable program was compiled with -O0 -fno-stack-protector -D_FORTIFY_SOURCE=0, otherwise gcc would have repositioned the overflow buffer and optimised the call so that the address is calculated at compile time.

In conclusion, my current approach was easily extendible in this case due to the similarity between a function pointer overwrite and smashing the stored EIP. Both cases essentially only differ at the point where the tainted memory location or register is moved to the EIP. They are detectable in exactly the same way and have the same symptom; namely, the attempted movement of tainted data into the EIP register. I would hypothesise that any vulnerability* with the same symptom can be dealt with in a similar way. So then we must ask, what other symtoms are there? Well, what can cause a program to crash? We have seen the case where the program attempts to execute data at an unmapped memory location, so that leaves invalid reads and writes.

Exploiting vulnerabilities detected via an invalid read/write

Old glibc heap exploits are a simple example of those that would be detectable as a result of an invalid write. I won’t go into the details of the method but the unlink macro essentially has the following effect, where next->fd and next->bk are under our control:

 *( next->fd + 12 ) = next->bk
 *( next->bk + 8 ) = next->fd

In this case the vulnerability will probably be discovered when the application tries the first write. This is a crucial difference between the earlier vulnerabilities and this one when considering how to automatically generate an exploit. Detecting the potential for an exploit is still simple – when an instruction writes memory we can simply check the destination operand to see if it is tainted**. The problem now becomes ‘how do we automatically determine a useful location to write?’. Assuming all data sources are randomised our options are limited somewhat. Depending on the protections in place the DTORS/GOT might be usable. Other than that (and depending on what the value we are corrupting actually points to) we could potentially just change the lower bytes of the location being written and attempt to modify some useful program variable on the heap***. This has the disadvantage of requiring some sort of input from the user, as determining what is a useful variable would seem to be mostly application specific.

If the location being written is on the stack we could potentially modify the lower bytes to change a stored EIP/EBP and then proceed in the same fashion as before. To automate this we could note the value of the ESP when a call instruction occurs and calculate the difference between this value and the location of the variable being written.

To sum these options up we have two potential output types. The first is a direct exploit; It writes the DTORS/GOT or changes the lower bytes of a stack variable to modify a stored EIP and may or may not be possible depending on the protections in place and the location being written. The second is a new program input that may lead to another crash at a different location. It is basically another fuzz input except with potentially some program variables corrupted. In this case, the result is that the program can continue execution and we hope that it hits another exploitable crash further on****.

For a read from an invalid address we also have a couple of options. If the variable being read into is later used for something such as memory allocation or as a bound on a loop/function involved in moving memory we might attempt to control this value. To do this automatically would require some form of static analysis to determine if the variable was ever used in such a context (or perhaps multiple runs of dynamic analysis). Once again, as in the write case, we also have the option to just manipulate the destination read in such a way that it is any valid address and hope that another vulnerability is later triggered. Unlike the write case this is pretty trivial though as we can just read from any address in the .text section.

* Obviously this doesn’t apply to certain vulnerability classes. It applies to most I can think of that we will exploit by somehow changing the execution pointer. Not trust problems, many race conditions, other design flaws etc.
** Some slightly more complicated analysis might be required here as sometimes a destination operand in a write can be legitimately tainted but restricted to the bounds of some safe chunk of data
*** We could also change the metadata of some other chunks on the heap but right now (3am in the morning) I can’t think of an obvious way to leverage this for code execution
**** In some cases the address written might be constrained in such a way that a heap spray is required to ensure it is mapped. Same idea applies for reads. Another potential problem is that all writable data stores might be randomised. In this case heap spraying could again be useful.

Morphing shellcode using CFGs and SAT

A friend of mine is working on a project that involves building a metamorphic engine for x86 assembly. One of the core parts of such a project is a context free grammar describing valid mutations for a given instruction. To use one of his examples, an instruction to put v in r can be modelled in a BNF style grammar as follows:

put_v_in_r(v1,r1) := pushl_v(v1) popl_r(r1) | movl_v_r(v1,r1)

This could be quite useful in the process of automatic exploit generation. I expect the user to provide shellcode expressing what they want the exploit to do, but what if that shellcode is constrained by conditionals in the program? e.g. the shellcode is \x41\x42\x43\x44 but there is a conditional in the program that causes it to abort if the byte \x41 is detected in the input stream.

One solution in this situation, is to simply ask the user for different shellcode, but in the presence of complex modifications and conditions it quickly becomes impossible to provide any kind of meaningful feedback on what they should do to pass the filters. How about an automated solution? We have the conditions in propositional form and, due to the other project mentioned, we also have access to a grammar of potential shellcode mutations. If we can combine the two into a new propositional formula, describing the effect of mutating the shellcode using this grammar and the conditions from the program, we might be able to use the power of SMT/SAT solvers to brute force a solution! If a solution is found then we can preserve the effects of the original shellcode, as well as passing the filters.

The main difficulty in implementing this solution comes from encoding the grammar in propositional form. The typical solution to encoding a transition relation as a propositional formula involves iteratively applying the transition relation to existing states until a fixed point is reached. That is, starting with our original shellcode we would expand the first non-terminal to one or more potential shellcodes. We then expand these in the same fashion, and so on until no further new states are added; we then have the set of all potential shellcodes given the starting shellcode and the provided grammar. Obviously, there can be problems storing the states if the number grows exponentially on each expansion, but this is a common verification problem and there are a number of approaches to resolve it, varying from “hope it doesn’t happen” to using a compact storage mechanism like binary decision diagrams. We can then extract the states that contain no non-terminals, i.e. they are a valid x86 instruction stream, conjoin this with the path condition and ask our SAT/SMT solver to enumerate the shellcodes that satisfy these conditions.

There is a bit of a snag with this fixed point approach though. What if our grammar contains an expansion A ->* aB , which states that A expands to, some terminal, a and, some non-terminal, B in one or more steps, and also the rule B ->* A? In this case we are no longer guaranteed to reach a fixed point, as we can enter an infinite loop. The obvious solution is to transform our grammar into one that doesn’t contain such rules. We can do this by setting an upper bound, k, on the number of such recursive iterations allowed and including these expansions as new non-terminals. This new grammar has lost some of the potential outcomes of the original but it can now be processed using our fixed point approach. It should be noted that this upper bound on the number of self-recursions isn’t quite as artificial or as limiting as it may seem. There will always be some upper bound on the size of the buffer used for shellcode and thus it may be practical, in some cases, to use quite a low value of k.

(In fact, for a real world implementation it is well worth considering how to involve the length of the buffer we intend to use as we perform the fixed point computation. There is no reliable way to favour shorter shellcodes that also satisfy the program constraints, but one optimisation would be to exclude any shellcodes that have already exceeded the length limit from future expansions.)

Fun uses for an SMT solver

An SMT solver (such as Z3, Yices or STP) is a decision procedure that can handle various types of arithmetic and other decidable theories. They make use of procedures specific to the theories they handle, such as linear arithmetic, in combination with the brute force power of a SAT solver. At a high level, proceed by iteratively by replacing the sub-expressions in a formula like (x + y < 10 || x > 9) && (y + z < 5) with propositional variables, to give something like (A || B) && (C). At this point we can apply a SAT solver to look for a solution. If none exists then we need not bother with analysing the abstracted expressions, as the formula is unsatisfiable. If the SAT solver finds a satisfiable solution we then restore the original expressions and pass the conjunction of this off to the core decision procedure which can deal with the semantics of the given theory. This process then proceeds in the standard DPLL method of iteration, involving finding UNSAT cores and backtracking. One of the best collection of resources on how they actually work is Leonardo De Moura’s MSR research page.

So, theory aside, how can we use this to entertain ourselves/do useful things? Well, an advantage of an SMT solver over a regular SAT solver is that we can quite easily express the sort of operations that tend to go on during program execution. We can model conditions, arithmetic and arrays. Using a theory that can model operations of this kind we can represent a path through a program (or several different paths in fact). By appending extra constraints we can then use an SMT solver to generate inputs that will take different conditional branches (useful for guiding a fuzzer), ensure memory locations have a specific value (useful for avoiding shellcode filters) and/or model conditions we want to check for, such as integer overflows.

A practical example may help illuminate why you might want to use an SMT solver. Consider the problem of a program containing an exploitable vulnerability where we have located a suitable buffer for shellcode but we now need to know what input to provide such that the desired shellcode is in that buffer when we jump to it. One solution is to use dynamic analysis to gather the entire path condition over user influenced data and then append the condition that the buffer we are interested in contains the shellcode. To do this we will gather all data movement and conditional instructions on tainted data (I won’t discuss how to do this here). At the vulnerability point we can then express this trace as the input to an SMT solver. Most solvers have their own API but they should also all accept a SMT-LIB formatted text file input, which is what I use so as not to tie myself to a single solver. You can read more about the format here, but essentially our input will have 4 sections.

Suppose, for the sake of having a tractable example, that our vulnerable program just takes two bytes of input and moves them once into a new location, and we want to determine what input to provide such that these new locations have the values 0x41 and 0x42. Our specification to the solver would then proceed as follows:

We begin with a pretty standard header that is basically static unless you want to use a different logic. I use quantified bitvector logic because it is easier to model mod32 arithmetic and data movement/conditionals at the byte and bit level than it would be with linear arithmetic or another logic.

(benchmark exploitSample
:status unknown
:logic QF_BV

Following this we then specify the name and type of all variables that we intend to use in the formula itself. The format of valid names is discussed in a document linked from the SMT-LIB website, but approximately follows the same rules as C.

Here I declare 4 variables, <i0, i1, n256, n257> (names unimportant), and I declare them to be bitvectors of site 8 e.g. they model a single byte

:extrafuns ((n256 BitVec[8])(i0 BitVec[8])(n257 BitVec[8])(i1 BitVec[8]))

Next is the assumptions section. Here we can specify the path condition, i.e. all data movement and conditionals that occured during the dynamic trace. We could just as easily express these in the next section but for ease of comprehension I use the assumptions section (according to the docs some parsers might also work slightly faster this way).

The format of this section should be familiar with anyone that has dabbled in programming languages that are list orientated. It is pretty much (operator operand1 operand2 ....)

This assumption is basically the expression of the conjuction (n256 := i0) AND (n257 := i1)

:assumption (and (= n256 i0)(= n257 i1)

Finally, we have our formula. This is of the same format as the assumption section, but I usually use it to express the part of the problem I want to solve. e.g. I encode the shellcode here.

(The form for a bitvector constant is bvDECIMAL_VAL[SIZE_OF_VECTOR])

:formula (and (= n256 bv65[8])(= n257 bv66[8]))

Obviously, this is a trivial example and we can quite easily spot the solution, but in a situation where there are 30+ movements of every byte of input, as well as conditionals and arithmetic, it quickly becomes impossible to solve by hand. I’ve used this technique to produce satisfying inputs for formulae over hundreds of input variables in a matter of seconds. As for the actual running, we can concatenate the above sections into a text file (probably best to use a .smt extension as some solvers seem to look for it) and invoke our solver to get something like the following:

nnp@test:~/Desktop/yices-1.0.21/bin/$ yices -smt -e < constraints.smt 
sat
(= i0 0b01000001)
(= i1 0b01000010)

Which can be parsed back to a C/Python/etc array using a relatively simple script.

This approach is exactly what I’m doing to auto-generate my exploits. In the assumptions, I specify everything I’ve traced during program execution, and in the formula I specify the shellcode and the desired location I want to use it in (determined by the method described in a previous post of mine), as well as the address of a shellcode trampoline. By also including the information on what the original user input was, the SMT solver can produce an output with the exploit information at the desired indexes and valid input at all others. The finished product is currently something that looks like this:

exploitArray = ['\x99\x31\xc0\x52\x68\x6e\x2f\x73\x68\x68\x2f\x2f\x62\x69\x89\
xe3\x52\x53\x89\xe1\xb0\x0b\xcd\x80\x58\x58\x58\x58\x58\x58\x58\x58\x42
\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42
...
42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x39\x84\x04\x08\x42\x42\x42\x42
...
']

oFile = open('pyExploit.in', 'w')
oFile.write(''.join(exploitArray))
oFile.close()

Which was produced from an input file containing only ‘B’s, the required shellcode and a set of potential trampoline addresses. The tool has determined that at the vulnerability point the ESP register points to a buffer containing data from the start of the user input, and thus has filled in the start of the input with the shellcode. It has also found the correct bytes that overwrite the EIP and replaced them with the address of a jmp %esp (0x08048439)

Not all shellcode locations are made equal

When automatically generating an exploit, one of the core tasks is to find a hospitable buffer that we can store our shellcode in so that at the vulnerability point we have somewhere to jump to. Some tools (like Byakugen) do this by scanning processes memory and searching for sequences of bytes from the input (or slight variations thereof, such as the upper/lower case conversion said input). There are a number of problems with this, first of all you can easily miss perfectly usable buffers. If all user input is passed through a function that adds 4 to every byte then this will be missed, but it is still a usable shellcode buffer. The second major problem is that if you do find a suitable buffer this way you have no guarantee that the values you’re changing weren’t used in a path condition on the way to the vulnerability point. Replacing this input with your shellcode could result in the vulnerability not being triggered.

I’m addressing the first problem by using dynamic data flow analysis to track all user input as it moves through registers and process memory, and the second problem by noting all path conditions that operate on such ‘tainted’ data. Already we can see a division between potential shellcode buffers. There are bytes that are used in path conditions, there are those that are not and in both cases there are those that are tainted by simple assignment instructions like mov, versus those that are tainted as a result of arithmetic operations like add.

In the case of a buffer consisting only of bytes that are not used in path conditions and are tainted via simple assignment then we can simply swap out those bytes for our shellcode, but what about the other cases? The approach I am taking is to record all tainting operations on a byte, as well as all path conditions and as a result, at a given program location we can express the potential values of a given byte via an algebraic formula. For example, assume in the following eax is tainted by the 10th byte of user input:

mov ecx, eax
add ecx, 4
mov [1234], ecx

At the end of this block we can represent the constraints on location ‘1234’ by the equation, t1 := in10 & t2 := t1 + 4 & mem1234 := t2. If we then want to discover if there exists an assignment to the 10th input byte that would result in location ‘1234’ being equal to 0x41, we can include this additional constraint in the previous formula to give, t1 := in10 & t2 := t1 + 4 & mem1234 := t2 & mem1234 == 0x41. A SAT/SMT solver will take this and return us an assignment to the literal ‘in10’ that will satisfy the formula e.g. 0x3E.

Now obviously real world constraint formulae will be much more complex than this, they could have hundreds of thousands of literals and conditions. A modern SAT/SMT solver will be able to handle such an input, but it could take a number of hours depending on the complexity. Because of this, it would be an advantage to be able to select those buffers that have the least complex constraint formulae (preferably one no more complex than a sequence of assignments, in which case a solver will not be needed). The basic classification scheme I’m using is represented by the following lattice:

Constraint complexity lattice
Constraint complexity lattice

Formulae are associated with a single point in the lattice, with the more complex formulae towards the top and the less complex towards the bottom. We can classify based on the complexity of arithmetic in the formula, as well as whether it contains a path condition, like a jump conditioned on user input, or not.

When an object is created, representing the constraints on a given byte, it starts at ‘DIRECT COPY’ and its position in the lattice is updated as operations are performed on it. A byte can only move up in the lattice (i.e. there is only a join function, no meet), with the lattice position of a byte resulting from the interaction of two tainted bytes being the join of the those two source bytes. The join function is defined as the greatest upper bound of two inputs. We can then use this lattice to represent the constraint complexity of a single byte, or a sequence of bytes by taking the join of all bytes in the sequence. Currently, I have classified a path constrained direct copy and non-linear arithmetic as being of equivalent complexity but, with the arithmetic support in modern SMT solvers, it might turn out that a formula made up entirely of arithmetic operations, linear or non-linear, is easier to solve than one containing predicates that must be satisfied.

The advantage of the above scheme is that we now have a way to classify potential shellcode buffers by the amount of effort it will be to use them as such. By doing this we can potentially avoid calling a SAT/SMT solver altogether, and in cases where we must we can select the easier formulae to solve.

Difficulties in taint data propagation without an IR

I’m currently building a tool that propagates taint information through a program as it executes. Essentially this consists of marking the data read by certain syscalls (e.g. read and recv) as ‘tainted’, or user controllable, followed by tracking the flow of this data through the program. The ‘propagation’ occurs when tainted data is used as a source in an instruction, thus tainting the result. For example, if the data at address 0x100 is tainted and the instruction mov ebx, [0x100] is executed then ebx is now tainted.

To track all possible taint propagation is a fairly involved task. The mov instruction for instance has 4 potential configurations (memory to memory, register to memory etc.) and many instructions (arithmetic ones for instance), do more than simply copying memory. Considering that most binaries from the /bin directory on Linux have between 70-100 unique instructions, and many of these instructions have unique semantics in terms of how they spread the data from source operands into the destination operands, modelling these effects for each instruction would seem to be a fairly involved task.

It’s also a task I’m having to deal with at the moment due to my choice of dynamic binary analysis (DBI) framework. I am using Pin, a free but closed source DBI. One of Pin’s primary goals is speed and thus it can’t afford something like converting each basic block to an intermediary representation (IR) (like Valgrind does). This means we are left with convenience functions to access operators and operands and are required to perform a case by case analysis on each instruction. Luckily, the functions provided by Pin are quite well thought out. Let us consider how to process an instruction like mov dword ptr ds[esi+edx*4+0x2a0], ecx, where ecx is tainted. First of all, we must register a function with Pin to instrument each instruction. This is about as simple as you’d expect.

INS_AddInstrumentFunction(instruction, 0);

This function will then be passed each instruction and will have to decide the correct way to handle it.

VOID
instruction(INS ins, VOID *v)
{
    UINT32 cat = INS_Category(ins);

    switch (cat) {
        case XED_CATEGORY_DATAXFER:
            switch (INS_Opcode(ins)) {
                case XED_ICLASS_XCHG:
                    processXCHG(ins);
                    break;
                case XED_ICLASS_MOV:
                    processMov(ins);
                    break;
etc etc

As you can see, we can work at two levels of granularity. The instruction categories groups together operators with similar emantics e.g there are seperate categories for string operations like rep movsb, arithmetic operations, interrupts etc. The operators in certain categories all have the same taint semantics and we need not go any deeper, but others contain instructions that must be handled on a case by case basis. DATAXFER for instance, contains xchg, mov and bswap, among others, all of which require special attention.

In our example we will end up in the processMov function. The purpose of this function is basically to figure out which of the 4 mov cases we are considering. Pin provides the INS_IsMemoryRead/Write functions, and others, that allow us to figure this out. Quite usefully, it also provides a mechanism to determine the exact address written/read and the number of bytes written/read at run time. This saves us having to calculate the effective address ourselves. For our example we would use these functions to determine that we are writing to memory, from a register, and then use the following call to insert a run time hook to our taint propagation function:

INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(simMov_RM),
    IARG_MEMORYWRITE_EA,
    IARG_MEMORYWRITE_SIZE,
    IARG_UINT32, INS_RegR(ins, INS_MaxNumRRegs(ins)-1),
    IARG_END);

You can see here the use of IARG_MEMORYWRITE_EA to get the effective address of the write and the use of INS_RegR to get the last read register (ecx in this case). At runtime this will result in the simMov_RM (simulate mov register to memory) function being called with the specified arguments. It is this function that will finally do the taint propagation.

As you can see there are many cases that must be considered, even for a seemingly trivial operator like mov. Even the above explanation doesn’t deal with the case where a register used in calculating an effective address is itself tainted and requires more code to process correctly. A potential solution, suggested by Silvio might be to do a case split in some cases and then have a default case for instructions not explicitly handled. A solution like this, that over-approximates the tainted data, could lead to extensive false positives though and is not really suitable for the kind of work I’m doing. This leaves one other obvious option for a ‘default case’, and that is to untaint all destination operands in instructions I do not handle. This will lead to an under-approximation of the amount of tainted data, until I encode the semantics of all instructions, but will eliminate false positives and allow me to work on other parts of the algorithm once I have encoded the most common instructions.