Anatomy of a Symbolic Emulator, Part 1: Trace Generation

A couple of months ago there was an ACM article on the SAGE whitebox fuzzing system from Microsoft Research. SAGE is one of the most interesting products of research on automated program testing in recent years and, according to Microsoft, has been used to find a massive amount of bugs in their various file parsers.

At its core, SAGE contains a symbolic emulator for executing instruction traces over symbolic data. As well as whitebox fuzzing, symbolic emulators are fairly useful things for a variety of reverse engineering, vulnerability discovery and program analysis tasks. Essentially, a symbolic emulator is a CPU emulator that not only supports operations on concrete numeric values but also on abstract values that may represent a range of concrete values.

In this series of posts I’m going to give an overview of a Python-based symbolic emulator for x86 that I’ve been working on (called JESTER) and show how it can be applied to the problem of whitebox fuzzing. Hopefully this will give an idea of how symbolic emulation works (it’s fairly simple) and also provide some insight into how systems like SAGE operate.

Consider the x86 instruction add eax, ebx. Operating over concrete values an emulator will do the obvious thing of taking the value in EAX, adding it to EBX and then storing the result back in EAX. It will also update the various flags that are affected. Over symbolic values however the result is a bit more interesting. Lets assume that EAX contains the abstract value V1 which represents an unconstrained 32-bit variable, and EBX contains the concrete value 0x10. In this case the emulator will create a new abstract value V2 which represents the addition of V1 and 0x10 and store that back in EAX. Diagrammatically, we can see that EAX now contains something that is a function rather than a single value.

      v1   10
       \   /
    EAX: +    

A slightly more complex diagram shows what the Zero Flag would hold after the above instruction.

      v1   10
       \   /
         +    0
          \  /
           ==   1    0
            \   |   /
        ZF: if-then-else

I purposefully used the word ‘function’ because what we end up with, in registers and memory, are expression trees that map from a given set of inputs to an output. As more instructions are emulated these trees get bigger and more difficult to reason about so people usually take the approach of exporting them to a SMT solver and querying their models that way. The obvious applications being input crafting, tracking user-influenced data and checking security properties. This is fairly well documented in previous posts and in a decade worth of academic literature so I won’t delve into the details.

The point of this post is instead to look at the overall architecture of a symbolic emulator with the aim of illuminating some of the components involved, more directly than is typically done in formal descriptions. I also want to give people an idea of how much or how little effort is involved in building these tools. In order to demonstrate the use of a symbolic emulator I’ll apply it to the problem of whitebox fuzzing i.e. using a symbolic emulator in combination with a SMT solver to generate inputs for a program guaranteed to force execution down a new path.

While writing this series of posts Rolf Rolles posted a great video/blog entry on the topic of input crafting using an SMT solver. Taking a leaf out of his book I’ve decided to accompany these with a video that demonstrates the tools described in operation and should ideally give some insight into their construction. The video is linked at the end but the following wall of text will give some context and might be worth glancing over. This isn’t the most entertaining of entries in the series and is mostly for completeness so if you’re bored out of your mind I accept minimal responsibility =)

1. Trace generation

An emulator needs some way to know what instructions execute and it also needs a starting memory and thread context. There are a few different approaches to getting such information. The Bitblaze/BAP research groups modified Qemu and hook in directly there, the guys working on S2E do something similar and I previously wrote a C++ library that was used as part of a Pintool at run time. There are a couple of problems with tying your emulation directly into the run time environment of the tool however. Firstly, it’s a lot more annoying to debug an extension to Qemu or PIN than n separate emulator and secondly, it prevents you from doing the emulation on a separate machine to the tracing. The second issue is probably the most important in the long run as to really scale whitebox fuzzing to the point where it is useful requires parallelism.

The approach I took this time around is directly inspired by the work of MSR on their Nirvana/iDNA tool, but much more simplistic. Instead of using the Pintool to do the emulation I use a lightweight one to just trace the instructions executed and other interesting events, like image loads/unloads, system calls and debugging info. If you’ve used PIN before then most of what I’m about to describe will be obvious and fairly boring so you might want to skip on to part 2 of this series of entries.

The trace format is uncompressed and unoptimised and to date I’ve not had any problems with that. A typical segment just looks as follows (L denotes an image load, I an instruction execution and C provides debugging information as discussed below):

L;4;/lib32/libc.so.6;f5c54000;157244
L;5;/lib32/libm.so.6;f5c2e000;24790
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f8;EBP:ffb291f8 ... 
I;0;8048fc5
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f0;EBP:ffb291f8 ... 
I;0;8048fc8
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291ec;EBP:ffb291f8 ... 

In the early stages of the project I worried about this and thought I’d have to come up with some compression method but that hasn’t been the case. Most file parsers generate traces that can be measured in 10s of millions of instructions and things of that scale easily fit in a few gigabytes of storage.

1.1 Debugging Assistance

Writing an emulator of any kind can be tedious work. It’s easy to make mistakes and get the semantics of an instruction slightly wrong or set a flag incorrectly. Initially I tried to counter this by writing unit-tests but it quickly became obvious that 1) These were never going to be exhaustive and 2) They were as likely to have mistakes as the emulator code itself. Instead, I added a debug mode to the tracer that logs the register values after each instruction (The lines starting with a “C” above). This then allows the emulator to compare its register values to the ones we know it should have and highlight any discrepancies. Tracing everything in /usr/bin/ and checking these values is a hell of a lot more exhaustive than any unit-testing I would have done! The only reason I’m mentioning this is that I’d recommend it to anyone writing something of this nature. The best tests are by far those you can extract from real binaries.

1.2 Handling system calls

One of the disadvantages of using a user-land tracer is that you miss out on any updates to memory that happens within the kernel. The only real way to handle this correctly is to define per-system-call handlers that know which memory addresses a system call will update based on its arguments or return value. In PIN this is fairly straightforward, you register a syscall entry and exit handler, get the syscall args or return value and then log whatever you need on exit.

int main(int argc, char *argv[])
{
        ...

        PIN_AddSyscallEntryFunction(SyscallEntry, 0);

        ...
}

VOID SyscallEntry(THREADID tid, CONTEXT *ctxt, SYSCALL_STANDARD std, VOID *v)
{
        ADDRINT syscall_num = PIN_GetSyscallNumber(ctxt, std);
        ...
#ifdef LINUX
        // Handle Linux syscalls
        switch (syscall_num) {
        case SYS_READ:
        {
                ADDRINT read_buf_ptr = PIN_GetSyscallArgument(ctxt, std, 1);
                size_t read_sz = PIN_GetSyscallArgument(ctxt, std, 2);

                t_data->log_addrs[read_buf_ptr] = read_sz;
                break;
        }
        ...
}

Handling each and every system call might seem like an onerous task but if you’re working on particular types of software (e.g. file parsers) then you can get away with a minimal subset e.g. open, read, lseek, mmap and a few others. My general approach is to just add them as necessary. You’ll encounter many more along the way but it turns out not a whole lot end up having any interaction with the user controlled data you’re interested in.

In the trace log format I included support for events other than those shown in the above snippet.). For syscalls as just discussed there is the M event which looks like as follows and tells the emulator to update the memory address given with the contents of a file.

M;0;f5f97000:syscall_c0_f5f97000_1000_1

There is also the ‘R’ event which tells the emulator to update a register with a particular value. This is useful for instructions you can’t handle for whatever reason. Other than that there isn’t really anything to capturing a trace. The only thing I haven’t mentioned is that on starting tracing, either at a given address or the programs entry point, you also need to log the programs memory and thread contexts at that point in order to give your emulator starting values. This is fairly straightforward though and PIN provides all the required API calls.

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p. The tools discussed are not publicly available but that may change in the future.)

Applying Taint Analysis and Theorem Proving to Exploit Development

Last week I spoke at REcon on the above topic. The slides can be found here. The abstract for this talk is as follows:

As reverse engineers and exploit writers we spend much of our time trying to illuminate the relationships between input data, executed paths and the values we see in memory/registers at a later point. This work can often be tedious, especially in the presence of extensive arithmetic/logical modification of input data and complex conditions.

Using recent (and not so recent) advances in run-time instrumentation we can go a long way towards automating the process of tracking input data and its effects on execution. In this talk we will discuss possible approaches to solving this problem through taint analysis. The solutions we will discuss are useful in many scenarios e.g. determining the set of conditional jumps under our control, discovering buffers in memory that are useful for injecting shellcode, tracking parameters to potentially insecure function calls, discovering ‘bad bytes’ for exploits and so on.

Building on this we will delve into the construction of logical formulae expressing the relationships between input and data in memory and ways in which these formulae can be manipulated and solved for interesting results. Depending on how we manipulate the initial formulae we can use theorem provers to automatically solve many problems e.g. ‘unraveling’ arithmetic/logical modifications on input, generating inputs that trigger specific paths, discovering the bounds on given variables and so forth.

I will update this post whenever the REcon videos etc go online. In the meantime I should probably finish at least one of the 5 or so blog posts in a semi-complete state 😛

Pin problem solved!

After some discussion on the Pin mailing list, I came to the conclusion that the best approach to my previous issue with Pin is simply to check the integrity of all values that get put in the EIP register from a potentially tainted source. This is arguably what I should have been doing anyway to detect the potential for an exploit. This means instrumenting ret, call, loop, jmp and its conditional variants, and checking the integrity of the value being used. In this context, this means checking that the value isn’t tainted by user input, and it is a fairly trivial task. For example, this code is called from my instruction instrumentation routine, registered with Pin:

VOID
processRET(INS ins)
{
    INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(checkRETIntegrity),
            IARG_REG_VALUE, LEVEL_BASE::REG_ESP,
            IARG_CONTEXT,
            IARG_INST_PTR,
            IARG_END);
}

Which results in the following function being called before any ret instructions are executed, and passed the value currently stored in the ESP register:

(tmgr is a custom object that maintains taint propagation information)

VOID
checkRETIntegrity(ADDRINT espVal, CONTEXT *ctx, ADDRINT pc)
{
    if (tmgr.isMemLocTainted(espVal, DWORD_SIZE)) {
        postRunAnalysis(true, ctx, TAINTED_RET);
        exit(0);
    }
}

Using this same technique I can also check for corruption of the stored frame pointer. I should also extend these checks to read/writes from and to memory. The problem gets a little more complex here as it is quite common for user input to taint the offset of a read/write. The only practical approach that springs to mind is to check the location being read/written is mapped; as attempting to detect if the tainted index is within a certain bound involves solving a whole other array of associated problems.

The romance is over…

..with Pin that is. Pin has been my DBI framework of choice for all of 3 weeks now and I haven’t had a single problem, until today. It would appear that smashing the stored EIP causes all sorts of problems for the Pin analysis code. What appears to happen, is that at the return from the function, Pin assumes the stored EIP is a valid instruction and it passes it to its analysis engine. If the smashed EIP points to the heap for example, Pin will then start to disassemble the data located there in an attempt to find the end of the basic block. If then keeps progressing through memory until it is eventually killed by the operating system (presumably for trying to disassemble an unreadable address).

The following code illustrates the issue:

#include <stdlib.h>

void smashySmashy(char *heapArr)
{
    asm("movl %0, 4(%%ebp)\n\t" 
        :   
        : "D"(heapArr));
}

int main(int argc, char *argv[])
{
    char *heapArr = NULL; 
    heapArr = malloc(256*sizeof(char));
    
    smashySmashy(heapArr);

    return 0;
}

Running the above code with any of the Pin instrumentation tools will result in the Pin process eventually aborting.

This is quite the showstopper for me, in terms of my usage of Pin, as functions like this are exactly the kind I need to analyse for exploit generation. I’ll update this post when I get a response from the developers.

Not all shellcode locations are made equal

When automatically generating an exploit, one of the core tasks is to find a hospitable buffer that we can store our shellcode in so that at the vulnerability point we have somewhere to jump to. Some tools (like Byakugen) do this by scanning processes memory and searching for sequences of bytes from the input (or slight variations thereof, such as the upper/lower case conversion said input). There are a number of problems with this, first of all you can easily miss perfectly usable buffers. If all user input is passed through a function that adds 4 to every byte then this will be missed, but it is still a usable shellcode buffer. The second major problem is that if you do find a suitable buffer this way you have no guarantee that the values you’re changing weren’t used in a path condition on the way to the vulnerability point. Replacing this input with your shellcode could result in the vulnerability not being triggered.

I’m addressing the first problem by using dynamic data flow analysis to track all user input as it moves through registers and process memory, and the second problem by noting all path conditions that operate on such ‘tainted’ data. Already we can see a division between potential shellcode buffers. There are bytes that are used in path conditions, there are those that are not and in both cases there are those that are tainted by simple assignment instructions like mov, versus those that are tainted as a result of arithmetic operations like add.

In the case of a buffer consisting only of bytes that are not used in path conditions and are tainted via simple assignment then we can simply swap out those bytes for our shellcode, but what about the other cases? The approach I am taking is to record all tainting operations on a byte, as well as all path conditions and as a result, at a given program location we can express the potential values of a given byte via an algebraic formula. For example, assume in the following eax is tainted by the 10th byte of user input:

mov ecx, eax
add ecx, 4
mov [1234], ecx

At the end of this block we can represent the constraints on location ‘1234’ by the equation, t1 := in10 & t2 := t1 + 4 & mem1234 := t2. If we then want to discover if there exists an assignment to the 10th input byte that would result in location ‘1234’ being equal to 0x41, we can include this additional constraint in the previous formula to give, t1 := in10 & t2 := t1 + 4 & mem1234 := t2 & mem1234 == 0x41. A SAT/SMT solver will take this and return us an assignment to the literal ‘in10’ that will satisfy the formula e.g. 0x3E.

Now obviously real world constraint formulae will be much more complex than this, they could have hundreds of thousands of literals and conditions. A modern SAT/SMT solver will be able to handle such an input, but it could take a number of hours depending on the complexity. Because of this, it would be an advantage to be able to select those buffers that have the least complex constraint formulae (preferably one no more complex than a sequence of assignments, in which case a solver will not be needed). The basic classification scheme I’m using is represented by the following lattice:

Constraint complexity lattice
Constraint complexity lattice

Formulae are associated with a single point in the lattice, with the more complex formulae towards the top and the less complex towards the bottom. We can classify based on the complexity of arithmetic in the formula, as well as whether it contains a path condition, like a jump conditioned on user input, or not.

When an object is created, representing the constraints on a given byte, it starts at ‘DIRECT COPY’ and its position in the lattice is updated as operations are performed on it. A byte can only move up in the lattice (i.e. there is only a join function, no meet), with the lattice position of a byte resulting from the interaction of two tainted bytes being the join of the those two source bytes. The join function is defined as the greatest upper bound of two inputs. We can then use this lattice to represent the constraint complexity of a single byte, or a sequence of bytes by taking the join of all bytes in the sequence. Currently, I have classified a path constrained direct copy and non-linear arithmetic as being of equivalent complexity but, with the arithmetic support in modern SMT solvers, it might turn out that a formula made up entirely of arithmetic operations, linear or non-linear, is easier to solve than one containing predicates that must be satisfied.

The advantage of the above scheme is that we now have a way to classify potential shellcode buffers by the amount of effort it will be to use them as such. By doing this we can potentially avoid calling a SAT/SMT solver altogether, and in cases where we must we can select the easier formulae to solve.

Difficulties in taint data propagation without an IR

I’m currently building a tool that propagates taint information through a program as it executes. Essentially this consists of marking the data read by certain syscalls (e.g. read and recv) as ‘tainted’, or user controllable, followed by tracking the flow of this data through the program. The ‘propagation’ occurs when tainted data is used as a source in an instruction, thus tainting the result. For example, if the data at address 0x100 is tainted and the instruction mov ebx, [0x100] is executed then ebx is now tainted.

To track all possible taint propagation is a fairly involved task. The mov instruction for instance has 4 potential configurations (memory to memory, register to memory etc.) and many instructions (arithmetic ones for instance), do more than simply copying memory. Considering that most binaries from the /bin directory on Linux have between 70-100 unique instructions, and many of these instructions have unique semantics in terms of how they spread the data from source operands into the destination operands, modelling these effects for each instruction would seem to be a fairly involved task.

It’s also a task I’m having to deal with at the moment due to my choice of dynamic binary analysis (DBI) framework. I am using Pin, a free but closed source DBI. One of Pin’s primary goals is speed and thus it can’t afford something like converting each basic block to an intermediary representation (IR) (like Valgrind does). This means we are left with convenience functions to access operators and operands and are required to perform a case by case analysis on each instruction. Luckily, the functions provided by Pin are quite well thought out. Let us consider how to process an instruction like mov dword ptr ds[esi+edx*4+0x2a0], ecx, where ecx is tainted. First of all, we must register a function with Pin to instrument each instruction. This is about as simple as you’d expect.

INS_AddInstrumentFunction(instruction, 0);

This function will then be passed each instruction and will have to decide the correct way to handle it.

VOID
instruction(INS ins, VOID *v)
{
    UINT32 cat = INS_Category(ins);

    switch (cat) {
        case XED_CATEGORY_DATAXFER:
            switch (INS_Opcode(ins)) {
                case XED_ICLASS_XCHG:
                    processXCHG(ins);
                    break;
                case XED_ICLASS_MOV:
                    processMov(ins);
                    break;
etc etc

As you can see, we can work at two levels of granularity. The instruction categories groups together operators with similar emantics e.g there are seperate categories for string operations like rep movsb, arithmetic operations, interrupts etc. The operators in certain categories all have the same taint semantics and we need not go any deeper, but others contain instructions that must be handled on a case by case basis. DATAXFER for instance, contains xchg, mov and bswap, among others, all of which require special attention.

In our example we will end up in the processMov function. The purpose of this function is basically to figure out which of the 4 mov cases we are considering. Pin provides the INS_IsMemoryRead/Write functions, and others, that allow us to figure this out. Quite usefully, it also provides a mechanism to determine the exact address written/read and the number of bytes written/read at run time. This saves us having to calculate the effective address ourselves. For our example we would use these functions to determine that we are writing to memory, from a register, and then use the following call to insert a run time hook to our taint propagation function:

INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(simMov_RM),
    IARG_MEMORYWRITE_EA,
    IARG_MEMORYWRITE_SIZE,
    IARG_UINT32, INS_RegR(ins, INS_MaxNumRRegs(ins)-1),
    IARG_END);

You can see here the use of IARG_MEMORYWRITE_EA to get the effective address of the write and the use of INS_RegR to get the last read register (ecx in this case). At runtime this will result in the simMov_RM (simulate mov register to memory) function being called with the specified arguments. It is this function that will finally do the taint propagation.

As you can see there are many cases that must be considered, even for a seemingly trivial operator like mov. Even the above explanation doesn’t deal with the case where a register used in calculating an effective address is itself tainted and requires more code to process correctly. A potential solution, suggested by Silvio might be to do a case split in some cases and then have a default case for instructions not explicitly handled. A solution like this, that over-approximates the tainted data, could lead to extensive false positives though and is not really suitable for the kind of work I’m doing. This leaves one other obvious option for a ‘default case’, and that is to untaint all destination operands in instructions I do not handle. This will lead to an under-approximation of the amount of tainted data, until I encode the semantics of all instructions, but will eliminate false positives and allow me to work on other parts of the algorithm once I have encoded the most common instructions.

Granular instrumentation with Pin

Of the DBI frameworks I’ve used, Pin has best support for instrumentation beyond the basic block/instruction level. The designers seem to have recognised that not all instrumentation needs to be done at the basic block or instruction level and thus you are provided with ways to instrument events at the routine level, as groups of basic blocks (traces), on thread creation/deletion and on image loading/unloading and more importantly you are provided with convenience functions to access important data when these events occur. I required this level of instrumentation earlier today when I ran into the following problem:

I have a Pin tool that performs data flow analysis at run time by marking data from certain function calls (e.g. read and recv) as tainted and then tracking this data through the program. We can do this relatively easily in Pin by registering a hook on all system calls as follows:

PIN_AddSyscallEntryFunction(syscallEntry, 0);
PIN_AddSyscallExitFunction(syscallExit, 0);

These functions can then access syscall arguments and return values. I noticed an issue earlier though where the number of tainted bytes was far higher than it should have been. The reason for this is that those syscall hooks also catch syscalls that take place while the executable is being loaded. I considered using a counter to denote a certain number of calls to skip over but then Silvio suggested a much cleaner alternative; to disable syscall hooking until the entry point has been executed. This turns out to be incredibly easy using Pin’s image level instrumentation.

We first register a function to be called when an image is loaded:

IMG_AddInstrumentFunction(image, 0);

Then using some functions provided by Pin we can extract the entry point of the main executable, and store it in a global variable entryPoint, as follows:

VOID
image(IMG img, VOID *v)
{
    if (IMG_IsMainExecutable(img))
        entryPoint = IMG_Entry(img);
}

And finally within our instruction level instrumentation function we can simply check the address of the instruction against this entry point value and set a global flag, passedEntryPoint, when the entry point is executed:

VOID
instruction(INS ins, VOID *v)
{
    if (!passedEntryPoint && INS_Address(ins) == entryPoint)
        passedEntryPoint = true;
...

A check on this flag within the syscallExit function, before we mark any data as tainted then allows us to avoid spurious tainting.

VOID 
syscallExit(THREADID tid, CONTEXT *ctx, SYSCALL_STANDARD std, VOID *v)
{
    int bufLen = 0;
    if (readData) {
        bufLen = PIN_GetSyscallReturn(ctx, std);
        if (passedEntryPoint && bufLen > 0) {
            tmgr.createTaintSourceM((unsigned)readBuf, bufLen, readCount++);
            totalBytesRead += bufLen;
        }
    }
}