SMT Solvers for Software Security (USENIX WOOT’12)

July 27, 2012May 26, 2016 seanhn3 Comments

At WOOT’12 a paper co-written by Julien Vanegue, Rolf Rolles and I will be presented under the title “SMT Solvers for Sofware Security”. An up-to-date version can be found in the Articles/Presentation section of this site.

In short, the message of this paper is “SMT solvers are well capable of handling decision problems from security properties. However, specific problem domains usually require domain specific modeling approaches. Important limitations, challenges, and research opportunities remain in developing appropriate models for the three areas we discuss – vulnerability discovery, exploit development, and bypassing of copy protection”. The motivation for writing this paper is to discuss these limitations, why they exist, and hopefully encourage more work on the modeling and constraint generation sides of these problems.

A quick review of the publication lists from major academic conferences focused on software security will show a massive number of papers discussing solutions based on SMT technology. There is good reason for this 1) SMT-backed approaches such as symbolic/concolic execution have proved powerful tools on certain problems and 2) There are an increasing number of freely available frameworks.

The primary domain where SMT solvers have shone, in my opinion, is in the discovery of bugs related to unsafe integer arithmetic using symbolic/concolic execution. There’s a fairly obvious reason why this is the case; the quantifier free, fixed size, bitvector logic supported by SMT solvers provides direct support for the precise representation of arithmetic at the assembly level. In other words, one does not have to do an excessive amount of work when modeling the semantics of a program to produce a representation suitable for the detection of unsafe arithmetic. It suffices to perform a near direct translation from the executed instructions to the primitives provided by SMT solvers.

The exploit generation part of the paper deals with what happens when one takes the technology for solving the above problem and applies it to a new problem domain. In particular, a new domain in which the model produced simply by tracking transformations and constraints on input data no longer contains enough data to inform a solution. For example, in the case of exploit generation, models that do not account for things like the relationship between user input and memory layout. Obviously enough, when reasoning about a formula produced from such a model a solver cannot account for information not present. Thus, no amount of computational capacity or solver improvement can produce an effective solution.

SMT solvers are powerful tools and symbolic/concolic execution can be an effective technique. However, one thing I’ve learned over the past few years is that they don’t remove the obligation and effort required to accurately model the problem you’re trying to solve. You can throw generic symbolic execution frameworks at a problem but if you’re interested in anything more complex than low level arithmetic relationships you’ve got work to do!

Anatomy of a Symbolic Emulator, Part 3: Processing Symbolic Data & Generating New Inputs

March 23, 2012May 26, 2016 seanhn9 Comments

In this final video in the series we go over how to generate new inputs for a program once we detect a user-influenced conditional branch. At the end there is also an example of the type of condition/resulting formula that we get from working on a real file parser, in this case libwebp.

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p)

Conclusion
This type of emulation, input generation and formula checking does not need to be limited to conditional jumps. As I discussed in a previous post you can use a similar approach to discover variable ranges, check for variable relationships and assist in figuring out complex algorithms. For example, one could generate a query to the solver every time an argument to malloc is found to be influenced by the user, or export a list of all functions that operate on user-influenced data to IDA for manual review. (In fact, a light-weight version of this approach in combination with fuzzing and an IDA importer is possibly more generally useful to an individual auditor than going down the route of full on whitebox fuzzing. More on that later =))

Anyway, I hope these videos provide some insight into how a whitebox fuzzer might work as well as one approach to building a symbolic emulator. To give an idea of the effort involved – the combined whitebox fuzzing, trace parsing and emulation code (along with supporting libraries) comes to around 10,000 lines of Python. Of this, the emulator itself is only 3000 lines or so. The PIN tracer is just under 1000 lines of C++.

Tracing is currently fairly unoptimised and parsing something like a video or image while tracing can result in a factor of 10-100 increase in running time. This usually means a wait of 30 seconds, which isn’t too bad for whitebox fuzzing as tracing is not performed too often but for other uses of a symbolic emulator (like tracing while fuzzing normally) this will require some work. The emulator itself is Python based and as such is not lightning fast. In the default run-mode it emulates ~5000 instructions per second. What this translates to is about 30-40 minutes per trace of an average file parser. This isn’t as bad as you might think however as the tests cases generated tend to be much more effective at hitting new code than what you would get from dumb fuzzing. Despite this we still need performance gains and I’m working on a few different solutions for that. Somewhere around 30,000+ instructions per second would be what I would consider approaching acceptable =)

To preempt the inevitable questions – for now JESTER is not publicly available but that may change in the future. It’s very much a research prototype at the moment where we’re testing out several approaches to improving performance and general usefulness. However, if you are interested in working on this type of research post a comment with a contact address (it won’t appear publicly) as I’m fairly sure we are currently hiring.

Anatomy of a Symbolic Emulator, Part 2: Introducing Symbolic Data

March 23, 2012May 26, 2016 seanhn4 Comments

In the previous post I discussed one way to go about gathering a trace for emulation. In this I’m going to talk about how we go about emulating such a trace, how and why we hook functions as they are emulated and how symbolic operations are performed.

As before, this post is accompanied by a video which demonstrates the code in action. Unlike the previous post I’ve decided to skip the paragraphs of rambling and instead most of the info is in the actual video itself =)

Topics covered:
– Introducing symbolic data via function hooks
– Performing computations on symbolic data

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p. Btw, near the end of the video I said something along the lines of “one of the advantages of whitebox fuzzing over symbolic emulation”. That makes no sense =) What I meant to say was “one of the advantages of whitebox fuzzing over normal symbolic execution”.)

Anatomy of a Symbolic Emulator, Part 1: Trace Generation

March 23, 2012February 13, 2023 seanhn3 Comments

A couple of months ago there was an ACM article on the SAGE whitebox fuzzing system from Microsoft Research. SAGE is one of the most interesting products of research on automated program testing in recent years and, according to Microsoft, has been used to find a massive amount of bugs in their various file parsers.

At its core, SAGE contains a symbolic emulator for executing instruction traces over symbolic data. As well as whitebox fuzzing, symbolic emulators are fairly useful things for a variety of reverse engineering, vulnerability discovery and program analysis tasks. Essentially, a symbolic emulator is a CPU emulator that not only supports operations on concrete numeric values but also on abstract values that may represent a range of concrete values.

In this series of posts I’m going to give an overview of a Python-based symbolic emulator for x86 that I’ve been working on (called JESTER) and show how it can be applied to the problem of whitebox fuzzing. Hopefully this will give an idea of how symbolic emulation works (it’s fairly simple) and also provide some insight into how systems like SAGE operate. Part 2 in this series covers how to introduce symbolic data into the system, and part 3 covers how the system can generate new inputs for the target program that take new paths.

Consider the x86 instruction add eax, ebx. Operating over concrete values an emulator will do the obvious thing of taking the value in EAX, adding it to EBX and then storing the result back in EAX. It will also update the various flags that are affected. Over symbolic values however the result is a bit more interesting. Lets assume that EAX contains the abstract value V1 which represents an unconstrained 32-bit variable, and EBX contains the concrete value 0x10. In this case the emulator will create a new abstract value V2 which represents the addition of V1 and 0x10 and store that back in EAX. Diagrammatically, we can see that EAX now contains something that is a function rather than a single value.

      v1   10
       \   /
    EAX: +

A slightly more complex diagram shows what the Zero Flag would hold after the above instruction.

      v1   10
       \   /
         +    0
          \  /
           ==   1    0
            \   |   /
        ZF: if-then-else

I purposefully used the word ‘function’ because what we end up with, in registers and memory, are expression trees that map from a given set of inputs to an output. As more instructions are emulated these trees get bigger and more difficult to reason about so people usually take the approach of exporting them to a SMT solver and querying their models that way. The obvious applications being input crafting, tracking user-influenced data and checking security properties. This is fairly well documented in previous posts and in a decade worth of academic literature so I won’t delve into the details.

The point of this post is instead to look at the overall architecture of a symbolic emulator with the aim of illuminating some of the components involved, more directly than is typically done in formal descriptions. I also want to give people an idea of how much or how little effort is involved in building these tools. In order to demonstrate the use of a symbolic emulator I’ll apply it to the problem of whitebox fuzzing i.e. using a symbolic emulator in combination with a SMT solver to generate inputs for a program guaranteed to force execution down a new path.

While writing this series of posts Rolf Rolles posted a great video/blog entry on the topic of input crafting using an SMT solver. Taking a leaf out of his book I’ve decided to accompany these with a video that demonstrates the tools described in operation and should ideally give some insight into their construction. The video is linked at the end but the following wall of text will give some context and might be worth glancing over. This isn’t the most entertaining of entries in the series and is mostly for completeness so if you’re bored out of your mind I accept minimal responsibility =)

1. Trace generation

An emulator needs some way to know what instructions execute and it also needs a starting memory and thread context. There are a few different approaches to getting such information. The Bitblaze/BAP research groups modified Qemu and hook in directly there, the guys working on S2E do something similar and I previously wrote a C++ library that was used as part of a Pintool at run time. There are a couple of problems with tying your emulation directly into the run time environment of the tool however. Firstly, it’s a lot more annoying to debug an extension to Qemu or PIN than n separate emulator and secondly, it prevents you from doing the emulation on a separate machine to the tracing. The second issue is probably the most important in the long run as to really scale whitebox fuzzing to the point where it is useful requires parallelism.

The approach I took this time around is directly inspired by the work of MSR on their Nirvana/iDNA tool, but much more simplistic. Instead of using the Pintool to do the emulation I use a lightweight one to just trace the instructions executed and other interesting events, like image loads/unloads, system calls and debugging info. If you’ve used PIN before then most of what I’m about to describe will be obvious and fairly boring so you might want to skip on to part 2 of this series of entries.

The trace format is uncompressed and unoptimised and to date I’ve not had any problems with that. A typical segment just looks as follows (L denotes an image load, I an instruction execution and C provides debugging information as discussed below):

L;4;/lib32/libc.so.6;f5c54000;157244
L;5;/lib32/libm.so.6;f5c2e000;24790
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f8;EBP:ffb291f8 ... 
I;0;8048fc5
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f0;EBP:ffb291f8 ... 
I;0;8048fc8
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291ec;EBP:ffb291f8 ...

In the early stages of the project I worried about this and thought I’d have to come up with some compression method but that hasn’t been the case. Most file parsers generate traces that can be measured in 10s of millions of instructions and things of that scale easily fit in a few gigabytes of storage.

1.1 Debugging Assistance

Writing an emulator of any kind can be tedious work. It’s easy to make mistakes and get the semantics of an instruction slightly wrong or set a flag incorrectly. Initially I tried to counter this by writing unit-tests but it quickly became obvious that 1) These were never going to be exhaustive and 2) They were as likely to have mistakes as the emulator code itself. Instead, I added a debug mode to the tracer that logs the register values after each instruction (The lines starting with a “C” above). This then allows the emulator to compare its register values to the ones we know it should have and highlight any discrepancies. Tracing everything in /usr/bin/ and checking these values is a hell of a lot more exhaustive than any unit-testing I would have done! The only reason I’m mentioning this is that I’d recommend it to anyone writing something of this nature. The best tests are by far those you can extract from real binaries.

1.2 Handling system calls

One of the disadvantages of using a user-land tracer is that you miss out on any updates to memory that happens within the kernel. The only real way to handle this correctly is to define per-system-call handlers that know which memory addresses a system call will update based on its arguments or return value. In PIN this is fairly straightforward, you register a syscall entry and exit handler, get the syscall args or return value and then log whatever you need on exit.

int main(int argc, char *argv[])
{
        ...

        PIN_AddSyscallEntryFunction(SyscallEntry, 0);

        ...
}

VOID SyscallEntry(THREADID tid, CONTEXT *ctxt, SYSCALL_STANDARD std, VOID *v)
{
        ADDRINT syscall_num = PIN_GetSyscallNumber(ctxt, std);
        ...
#ifdef LINUX
        // Handle Linux syscalls
        switch (syscall_num) {
        case SYS_READ:
        {
                ADDRINT read_buf_ptr = PIN_GetSyscallArgument(ctxt, std, 1);
                size_t read_sz = PIN_GetSyscallArgument(ctxt, std, 2);

                t_data->log_addrs[read_buf_ptr] = read_sz;
                break;
        }
        ...
}

Handling each and every system call might seem like an onerous task but if you’re working on particular types of software (e.g. file parsers) then you can get away with a minimal subset e.g. open, read, lseek, mmap and a few others. My general approach is to just add them as necessary. You’ll encounter many more along the way but it turns out not a whole lot end up having any interaction with the user controlled data you’re interested in.

In the trace log format I included support for events other than those shown in the above snippet.). For syscalls as just discussed there is the M event which looks like as follows and tells the emulator to update the memory address given with the contents of a file.

M;0;f5f97000:syscall_c0_f5f97000_1000_1

There is also the ‘R’ event which tells the emulator to update a register with a particular value. This is useful for instructions you can’t handle for whatever reason. Other than that there isn’t really anything to capturing a trace. The only thing I haven’t mentioned is that on starting tracing, either at a given address or the programs entry point, you also need to log the programs memory and thread contexts at that point in order to give your emulator starting values. This is fairly straightforward though and PIN provides all the required API calls.

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p. The tools discussed are not publicly available but that may change in the future.)

Parts 2 and 3 of this series can be found here and here.

Code Analysis Carpentry (Ruxcon 2010)

October 27, 2010May 26, 2016 seanhn1 Comment

Ruxcon is next month and I’ll be giving a talk titled Code Analysis Carpentry (Or how not to brain yourself when handed an SMT solving hammer). Here’s the abstract:

This talk will be one part “Oh look what we can do when we have a Python API for converting code into equations and solving them” and one part “Here’s why the world falls apart when we try to attack every problem in this way”.

One popular method of automated reasoning in the past few years has been to build equational representations of code paths and then using an SMT solver resolve queries about their semantics. In this talk we will look at a number of problems that seem amenable to this type of analysis, including finding ROP gadgets, discovering variable ranges, searching for bugs resulting from arithmetic flaws, filtering valid paths, generating program inputs to trigger code and so on.

At their core many of these problems appear similar when looked at down the barrel of an SMT solver. On closer examination certain quirks divide them into those which are perfectly suited to such an approach and those that have to be beaten into submission, often with only a certain subset of the problem being solvable. Our goal will be to discover what problem attributes place them in each class by walking through implemented solutions for many of the tasks. Along the way the capabilities and limitations of the modern crop of SMT solvers will become apparent. We will conclude by mentioning some other techniques from static analysis that can be used alongside a SMT solver to complement it’s capabilities and alleviate some of the difficulties encountered.

The schedule is full of talks that look like fun. I’m really looking forward to seeing a few in particular, especially those by Silvio Cesare, Ben Nagy and kuza55. Looks like it’ll be just as entertaining as REcon (with hopefully not quite as much sun-burn)! Mostly I’m just looking forward to watching 30 people get on stage and try to out do each other with sheep related innuendo. If there isn’t at least one drunken presenter abusing the crowd I’m calling it a failure!

Determining variable ranges (Part I)

October 15, 2010May 26, 2016 seanhn4 Comments

This post is the first of a two part discussion in which we’ll tackle the title problem with some scripts on top of Immunity Debugger 2.0. To begin lets pose the question as follows: ‘Given an output register or memory location v, of width w, and a set of input registers/memory locations (x_0, w_0), …, (x_n-1, w_n-1) what are the possible values for v after the execution of a sequence of n instructions i_0, i_1, … i_n-1‘. Lets call this sequence p representing any concrete path through a program. Before we continue it may be worth your while to think about some solutions to this problem given whatever tools and techniques you think appropriate.

The most primitive solution is probably to just run the code and iterate over all possible combinations of input variables. This will require exactly 2^w_0 * … * 2^w_n-1 runs and with a running time exponential the number of input variables. Presuming we only want to run our variable range analysis on a specific sub-path then for this solution we need a way to snapshot a process and re-run the sub-path multiple times or, if feasible, re-run the entire path and enable the analysis for the required sub-path.

Lets now presume we have at our disposal a method of converting p into a formula F(p) interpretable by an SMT solver that gives the precise semantics of p. Given what we know about SMT solvers one obvious solution to the above problem immediately presents itself. By creating the conjunction of F(p) with a formula expressing v := x for each x in the range 0 to 2^w-1 we can, with 2^w-1 queries of our solver, determine the full range of possible values for v, those x values resulting in a satisfiable formula, as well as the assignments to input variables that result in each value.

How long will this many queries take to run? Naturally this will be a factor of n and the complexity of the instructions in p (due to the relative solving costs for linear vs non-linear arithmetic and so on) but let’s chose an instruction sequence and do some testing. The following instruction sequence was chosen as it is short enough to serve as a good example but arithmetically complex enough that the potential range of values for ESI at 0x76792186 require some manual labour to figure out.

Test Instruction Sequence — Our tests will be for the range 0x767920c6:76792186

The input variables to this sequence are EDX and ECX and the output variable we are interested in is ESI (although we could of course look for results for any variable written to during this sequence). Now it’s time to start thinking about our algorithm a little more. For certain output values over this instruction sequence our solving code can hit upwards of 6000 queries a minute but doing this 2^32 times is still going to take a prohibitively long time to finish. Not to mention the fact that we’ll want to run this solution on much longer instruction sequences in the future.

The approach I decided to take here was to try to divide the possible output values into ranges and try to weed out entire ranges of unsatisfiable output values at once by constructing queries like v >= LOWER_BOUND AND v <= UPPER_BOUND. If such a formula is satisfiable then we can conclude that within the range LOWER_BOUND:UPPER_BOUND is at least one output value that can be achieved given the correct input.

Effectively this results in a binary search of the output value space. In the best case scenario the largest ranges possible are proven to be unsatisfiable, or in other words no possible output value lies within that range. e.g. If our range is 0:2**32-1 and we split it in two to query the ranges 0:(2**32-1)/2 and (2**32-1)/2 + 1: 2**32-1 and discover that the first range gives an unsatisfiable result then we have immediately removed 2**32/2 potential output values that we may have otherwise had to investigate. The algorithm is essentially:


lower = 0
upper = 2**32 - 1

ranges = [(lower, upper)]
while len(ranges) > 0:
    lower, upper = ranges.pop()
    query = create_range_query(lower, output_register, upper)
    if solver.isSatisfiable(query):
        diff = lower - upper
        if diff <= bucket_size:
            sat_ranges.append((lower, upper))
        else:
            half = diff/2
            ranges.append(lower, lower + half)
            ranges.append(lower + half + 1, upper)

Once this code has run the array sat_ranges contains a list of ranges. In each of these ranges there is at least one valid output value that could potentially end up in v given the correct input values. The larger we set bucket_size the fewer queries we require in this first phase to discover our candidate ranges but the more coarse grained our initial results are. The following screenshot shows an analysis of our earlier code using a bucket size of 32768.

First phase — Coarse analysis searching for candidate ranges

1159 queries were required to discover 256 candidate ranges with total run time at 456 seconds. While these results are still far from giving us exact ranges for v (8388352 queries away in fact, which is still far too many to be practical) a definite pattern has emerged in the output values. This pattern is of course an artefact of the logical instructions found in the instruction sequence and we can see quite clearly in the above screen shot all our valid ranges are of the form 0xY4X00000:0xY4X07FFF for all X, Y in [1,..,9]. A quick glance over the instructions involved in the code sequence is enough to indicate that if we can resolve the valid indexes into any of these ranges then the same indexes will be valid for the other ranges.

The script varbounds.py also allows us to specify the initial lower and upper bounds for the ranges so we can use that to investigate a particular range.

Checking a specific range — Here we check the 0x0x400000:407fff range

Now we’re starting to get somewhere! By setting the bucket size to 256 and querying the range 0x400000:0x407fff we’ve managed to reduce our potential valid results to living within 16 different buckets of size 256. All this with only 37 queries to the solver and a run time of just over a second. The reason for so few queries is essentially that large ranges were proved unsatisfiable meaning we didn’t have to split/investigate their sub-ranges. In order to discover the exact values that are possible within these ranges we need to take one more step; instead of querying ranges we need to query for concrete values within these ranges as initially proposed. The difference now is that we have reduced our untenable 2**32 queries to a mere 4000 or so.

(In the following screenshot the -p option tells the script to use precise queries instead of ranges. The ranges under ** Valid values for ESI ** were constructed by individual queries to the values within these ranges and then merging the ranges after the fact to make the results easier to read)

In the end you can see that we have proven there are exactly 256 possible valid values from that initial large range of size 32768. Precise results required 4133 queries which only took 87 seconds to complete. At this point we can extrapolate our results back to the other ranges and come up with a pattern we know any result in ESI will match. It is 0xY4X4Zc0 to 0xY4X4Zcf for all X, Y, Z in [0,..,9].

A quick test on another range confirms this pattern.

Another test — Confirming our byte pattern holds for the other ranges

Improving the above solution with abstract interpretation

So for the above test case the results are pretty cool. It is of course easy to think of examples where blindly applying this technique will result in failure to reduce the initial problem of a large potential output space. Consider any sequence of instructions built purely from arithmetic instructions like add, mul, div etc. In this case the binary search would simply split the full output range into a sequence of bucket sized ranges without actually removing any of them as all values are possible given at least one free input variable.

Another pretty obvious point is that by ignoring the semantics of logical operators we are effectively ignoring an even simpler way to perform the initial filtering. In our solution we blindly convert the initial instruction sequence to equations in our SMT solver and then entirely ignore the original instructions for the remainder of our task. Much like the above case we are losing valuable information in doing so. Take the instruction and eax, 0xffff. If such an instruction appears right before the value in eax is moved into our output register then an analysis with the ability to include information on logical operator semantics could pre-limit its search to values in the 0xffff range. A very simple application of abstract interpretation could drastically reduce the number of cases where we need to go to a solver at all by maintaining information on the state of each bit for each register or memory location in play. With an abstract domain consisting of {MUST_BE_SET, CANT_BE_SET, MIGHT_BE_SET} and abstract versions of and, or, xor etc. we could easily propagate the required information e.g. MUST_BE_SET ^ CANT_BE_SET => CANT_BE_SET, MUST_BE_SET ^ MUST_BE_SET => MUST_BE_SET, MUST_BE_SET ^ MIGHT_BE_SET => MIGHT_BE_SET and so on. I haven’t thought this one through entirely and I’m sure there’s a nicer way to present the theory but merging information gained from abstract interpretation into solutions with SMT solving is, in my opinion, one of the most powerful approaches to program analysis at the moment.

A third instruction category worth considering is conditional jumps. By their nature every conditional jump imposes constraints on the path in which it occurs. If it is taken the positive constraint holds, otherwise its negation holds. Once again it is obvious that by ignoring these constraints at a higher level than SMT equations we risk pointless queries to the solver but I haven’t considered the problem enough to come up with a good lightweight abstract analysis. The emphasis here being on ‘lightweight’ as if the solution is more expensive than running the solver then it isn’t worth it.

Scalability

One interesting question posed to me about this script is ‘How does it compare to simply running the code and iterating over the possible input values?’. The honest answer is that I have no metrics on that approach to compare it with. What I would say however is that practically implementing that kind of solution for longer program paths involving memory accesses is likely to be no walk in the park either. As mentioned earlier, each run in such a solution is going to have to restore a ‘snapshot’ of some kind and will require exactly 2^w_0 * … * 2^w_n-1 iterations. The question then becomes ‘How does the solvers running time scale with the number of input variables and their width?’. Again my answer is really the same in that I haven’t ran enough experiments to come to any valid conclusions yet. I would imagine that in the worst case the scaling is just as bad given the nature of the problem but that the average case is far better. Gathering some decent stats on this is actually my next intended project when I get some free time.

Conclusion

1. If you have a specific question you need answering then dealing with annoying instruction sequences doesn’t have to be a nightmare. Solvers. Use them.
2. Solvers. Don’t use them for everything. There’s a world of other awesome program analysis techniques and tools out there than in many situations complement theorem provers and decision procedures very nicely and in other cases can give far superior results. Also, as in the above situation there’s always room for a healthy amount of human interaction and guidance.

Applying Taint Analysis and Theorem Proving to Exploit Development

July 17, 2010May 26, 2016 seanhn

Last week I spoke at REcon on the above topic. The slides can be found here. The abstract for this talk is as follows:

As reverse engineers and exploit writers we spend much of our time trying to illuminate the relationships between input data, executed paths and the values we see in memory/registers at a later point. This work can often be tedious, especially in the presence of extensive arithmetic/logical modification of input data and complex conditions.

Using recent (and not so recent) advances in run-time instrumentation we can go a long way towards automating the process of tracking input data and its effects on execution. In this talk we will discuss possible approaches to solving this problem through taint analysis. The solutions we will discuss are useful in many scenarios e.g. determining the set of conditional jumps under our control, discovering buffers in memory that are useful for injecting shellcode, tracking parameters to potentially insecure function calls, discovering ‘bad bytes’ for exploits and so on.

Building on this we will delve into the construction of logical formulae expressing the relationships between input and data in memory and ways in which these formulae can be manipulated and solved for interesting results. Depending on how we manipulate the initial formulae we can use theorem provers to automatically solve many problems e.g. ‘unraveling’ arithmetic/logical modifications on input, generating inputs that trigger specific paths, discovering the bounds on given variables and so forth.

I will update this post whenever the REcon videos etc go online. In the meantime I should probably finish at least one of the 5 or so blog posts in a semi-complete state 😛

Gathering constraints from conditional branches

June 19, 2009May 26, 2016 seanhn4 Comments

Dataflow analysis without considering the effects of conditional branching is of limited use when trying to generate an exploit or find a vulnerability. For lightweight analysis the inaccuracies introduced may be acceptable, but when creating an input that should drive the program to a specific point we must consider all conditional branches affected by user input and represent the inherent restrictions in the path condition.

Consider the example:

cmp eax, ebx
jg VULNERABLE_FUNCTION

In this case, if either eax or ebx are tainted then our choice of input can influence the condition. We would therefore like to have this condition represented in the formula we will generate to represent the path condition. We can achieve this be separating intructions into two categories; the set A, those that write the EFLAGS, and the set B, those that read the EFLAGS. Associated with every entry in A and B is the list of EFLAGS it writes or reads respectively. We represent the EFLAGS register using vector E <e_0, e_1, ... , e_31>. Each entry in this vector stores a list containing the operands of the last instruction to write this particular eflag. If the last instruction to write the eflag was not tainted by user input then the operand list will be empty.

(In practice each entry in the vector E is an object containing 1 to N vectors, where N is the number of operands of the instruction that wrote e. Each operand vector is of size 1 to DWORD_SIZE, with each entry representing the taint information for a single byte of that operand. The cost of instrumenting at the byte level is slower runtimes due to the extra overhead, but it means we gain accuracy and expressiveness in our formulae)

Using this infrastructure we can then begin to record the operands involved in instructions that write the EFLAGS register at runtime. Our instrumentation routine for every instruction ins looks something like this:

if ins in A
    operandList = ins.operandList;
    if (operandList.tainted)
        for eflag in ins.eflagsWritten:
            eflag.operandList = operandList;
    else
        for eflag in ins.eflagsWritten:
            eflag.operandList = [];

Once this algorithm is in place we can then process instructions from set B, those that read the EFLAGS register. For now I will just consider conditional jumps and exclude other instructions like cmpxchg, adc etc. The majority of eflag write/read combinations can be easily expressed by simply taking any of the eflags read by a given instruction, extracting the operand list (if there is one), and directly encoding the semantics of the condition on the operands. e.g. if the instruction is cmp x, y and the condition is jg, and the jump is taken, then we can express this as (> x y). To determine whether the jump has been taken or not, and hence whether the condition should be negated, we take an approach similar to that of Catchconv/Smartfuzz. That is, on the taken branch we simply insert a callback to our analysis routine with the condition and on the fallthrough path we insert a callback with the negated condition. Once this callback is triggered we basically have all the components required to store the condition for later processing – the operands involved, the conditional jump and whether the condition evaluated to true or false.

When we decide to generate an input it becomes necessary to convert these conditions to an SMT-LIB compliant formula. We first filter out the relevant conditions. Basically we only need include information on conditions if our changes to the original input might result in a different outcome. This is relatively simple to determine. For every condition we can extract all TaintByte objects that represent the taint information and dataflow history of a given byte. We can trace this back to a specific byte from user input and if we then decide to change this byte, the conditions it occur in will be included in the formula. SMT-LIB format has support for concatenation of bit-vectors so we generate our formula by concatenating the information on single bytes into words and double words where necessary. For example, if some of our input was treated as a 4 byte integer and compared to 0 we would end up with conditions like (not (= (concat n1 n2 n3 n4) bv0[32])), which expresses the concatenation of 4 bytes (which would earlier be declared as being of size 8 ) and the condition that these 4 bytes should not equal 0 when concatenated together. SMT-LIB bitvector arithmethic is modulo-X, where X is the size of the bit-vectors involved, and so it accurately models integer overflows/underflows.

(I mentioned that this approach works for most instructions. One case where some hackery is required is the test instruction. Typically something like test eax, eax is followed by a jz/je instruction. Obviously the intention here is that we are checking if eax is 0, not checking if eax == eax . This situation is easily detected by simply comparing the operands that set the zero flag on a jz/je for equality)

To demonstrate how these conditions can be gathered and used I have put together an annotated run of my prototype against a simple vulnerable program. It runs user input through a filter to ensure it is alphanumeric and if so a strcpy vulnerability is triggered. It can be found here. To get an idea of the kind of SMT constraints this generated I’ve also uploaded the entire formula here. The constraints that exist specifying every byte cannot be 0 are gathered from strcpy.

Sean Heelan's Blog

Software Exploitation and Optimisation

Category: Symbolic execution