Moving location!

A few months back I started Persistence Labs with the goal of developing better tools for bug discovery, reverse engineering and exploit development. I’ve also moved my blog over to that domain and the new RSS feed is here.

Anyway, that’s about it really =) I’ll be making any future blog posts over there, starting with the release of a new research paper by Agustin Gianni and I titled Augmenting Vulnerability Analysis of Binary Code, published at ACSAC this year. It describes an approach to attack surface identification and code prioritisation during vulnerability auditing. Go check it out!

SMT Solvers for Software Security (USENIX WOOT’12)

At WOOT’12 a paper co-written by Julien Vanegue, Rolf Rolles and I will be presented under the title “SMT Solvers for Sofware Security”. An up-to-date version can be found in the Articles/Presentation section of this site.

In short, the message of this paper is “SMT solvers are well capable of handling decision problems from security properties. However, specific problem domains usually require domain specific modeling approaches. Important limitations, challenges, and research opportunities remain in developing appropriate models for the three areas we discuss – vulnerability discovery, exploit development, and bypassing of copy protection”. The motivation for writing this paper is to discuss these limitations, why they exist, and hopefully encourage more work on the modeling and constraint generation sides of these problems.

A quick review of the publication lists from major academic conferences focused on software security will show a massive number of papers discussing solutions based on SMT technology. There is good reason for this 1) SMT-backed approaches such as symbolic/concolic execution have proved powerful tools on certain problems and 2) There are an increasing number of freely available frameworks.

The primary domain where SMT solvers have shone, in my opinion, is in the discovery of bugs related to unsafe integer arithmetic using symbolic/concolic execution. There’s a fairly obvious reason why this is the case; the quantifier free, fixed size, bitvector logic supported by SMT solvers provides direct support for the precise representation of arithmetic at the assembly level. In other words, one does not have to do an excessive amount of work when modeling the semantics of a program to produce a representation suitable for the detection of unsafe arithmetic. It suffices to perform a near direct translation from the executed instructions to the primitives provided by SMT solvers.

The exploit generation part of the paper deals with what happens when one takes the technology for solving the above problem and applies it to a new problem domain. In particular, a new domain in which the model produced simply by tracking transformations and constraints on input data no longer contains enough data to inform a solution. For example, in the case of exploit generation, models that do not account for things like the relationship between user input and memory layout. Obviously enough, when reasoning about a formula produced from such a model a solver cannot account for information not present. Thus, no amount of computational capacity or solver improvement can produce an effective solution.

SMT solvers are powerful tools and symbolic/concolic execution can be an effective technique. However, one thing I’ve learned over the past few years is that they don’t remove the obligation and effort required to accurately model the problem you’re trying to solve. You can throw generic symbolic execution frameworks at a problem but if you’re interested in anything more complex than low level arithmetic relationships you’ve got work to do!

Better Interpreter Fuzzing with Clang

Last weekend I decided to finally put some effort into investigating Clang, the C/C++/Objective-C frontend for LLVM. Clang is interesting as it is not only designed to provide efficient parsing and processing during compilation, but also as a platform for program analysis tools. Other options for static analysis of C/C++ exist e.g. GCC extensions, cscope hackery, and the various things built on top of CIL (e.g. Frama-C). However, Clang is attractive to me as it supports both C and C++, has a very permissive license, includes a well designed, extensive and documented API and is itself written in C++ (which might be a negative point depending on your view of the world but I like it :P). It also appears to have the backing of both Apple and Google, with a number of their engineers contributing regularly to both LLVM and Clang. All in all, it feels like a maturing project with a skilled and active community behind it.

Clang supports three approaches to writing analysis tools. Using libclang is the encouraged method for most tasks and is probably what you’re looking for if all you want to do is iterate over ASTs and pull out data. The next option is to write a plugin for Clang to be run during the build phase of a project. This is slightly more involved and doesn’t have the benefit of allowing you to use a Python/whatever wrapper around the API. However, if you want to do things like automated source rewriting, or more involved static analysis as part of a build, then this is where you want to be. The final approach is using libtooling. I haven’t looked into writing anything based on this yet but it appears to offer the benefits of writing a standalone tool provided by libclang with the power of full AST control offered by a plugin. The only potential downsides, in comparison to libclang, that I can see are that you cannot write your code in a higher level language and it does not have access to as convenient an API.

After a bit of digging around in documentation and source, I think libclang offers the most gentle introduction to Clang as a program analysis platform. It’s not as powerful as the other two methods but has a stable API (check out include/clang-c/Index.h for the C interface and bindings/python/clang/cindex.py for the Python bindings) and supports a nice visitor based approach for writing tools that traverse an AST. From reading over the above code, and following Eli Bendersky’s introduction to using the Python bindings, you should be able to get up and running fairly easily. As a side note, libclang has progressed somewhat since that blog post so the limitations related to missing cursor kinds are now largely alleviated. There are still some missing features but we’ll get to that later.

A relevant question about now is probably “What the hell has all of that got to do with interpreter fuzzing?”. In general, not a whole lot =) However, when trying to think of some small problem I could tackle to get a feel for libclang I recalled an issue I previously encountered when fuzzing PHP. When fuzzing an interpreter there are typically two high level stages that are somewhat unique to interpreter fuzzing; 1) Discover the functions you can call and 2) Discover how to correctly call them e.g. the number and types of their arguments.

In this regard, PHP is a slightly easier target than say, Python or Ruby, because on the first point it has a lot of global functions that can be called without instantiating a class or figuring out what module to import. These functions can be easily discovered using get_defined_functions. Using this approach it is simple to enumerate a fairly large attack surface that calls directly into the C backend (somewhere upwards of 500 functions in a default configuration) and get fuzzing.

Point 2 remains a problem however. Each PHP function implemented in the C backend takes a number of typed parameters. Some type conversion can and does take place, which means you can specify a boolean when an int is required or other such things. Naturally, there are some types that are completely incompatible, e.g. a function callback is required and you provide an integer type. If you get this wrong the function will bail out without performing any further processing. As well as incorrect argument types, specifying an incorrect number of arguments will also often result in the function execution terminating before anything interesting happens. Precise implementation details can be found in the zend_parse_va_args and zend_parse_arg_impl functions of Zend/zend_API.c.

When fuzzing PHP we are then left with a bit of a conundrum. We can easily discover a substantial attack surface but we still need to figure out the types and number of arguments that each function expects. One solution is to just play a numbers game. Fuzzing an interpreter is fast… really fast. It is also easily distributed and instrumented. So, if we just guess at the types and parameters we’ll probably be correct enough of the time [1]. This has a slight issue, in that there is going to be a bias towards functions that take a smaller number of arguments for purely probabilistic reasons. An improvement is to just parse the error messages from PHP which give an indication as to why the function bailed out e.g. incorrect types or incorrect number of arguments. This is all kind of ugly however and it would be nice if we could just know up front the correct argument types and the exact number required.

Enter Clang! An efficient solution for this problem is simply to parse every C file included in the PHP build, find all calls to zend_parse_parameters, extract the format string that specifies the arguments and their types and then relate this information back to the function names discovered via get_all_functions. It’s a fairly straightforward task on the PHP code base and could probably be achieved via regular expressions and some scripting. That said, using libclang we can come up with a much tidier and less fragile solution. Before I delve into the solution I should mention that I’ve uploaded the code to GitHub under the name InterParser, minus the fuzzer, should you want to try it out for yourself, extend it or whatever.

I’ll explain the code using questions for which the answers were not immediately obvious from the Clang documentation or are things that were interesting/surprising to me.

How do I run my libclang based tool on all files in a build?

The solution to this is also a solution to the next problem so I’ll deal with it there!

How do I know the correct compiler arguments to use when parsing a file?

The clang_parseTranslationUnit of Index.h allows you to specify the arguments that would be passed to the compiler when parsing a particular source file. In some cases it is necessary that you specify these arguments correctly or the code that is parsed will not accurately reflect the code that is included when you run make. For example, if a chunk of code is enclosed within #if BLA / #endif tags and BLA is supposed to be defined, or not, via the compiler arguments then unless you provide the same argument when creating the translation unit the code will not be parsed.

This problem, and the previous one, could easily be solved if we could tell 1) every file processed when make is run and 2) the command line argument passed to the compiler on each invocation. After browsing the Clang documentation for a solution a friend of mine suggested the obvious answer of just wrapping the compiler in a script that logged the relevant information during the build. The script creplace.py provides this functionality. By pointing the CC and CXX environment variables at it you should get a log of all files processed by the compiler when make is ran, as well as the arguments passed.

The ccargparse.py module provides the load_project_data function which you can then use to build a Python dictionary of the above information. If there’s an inbuilt Clang API to do all this I didn’t find it [2], but if not you should be able to just drop in my code and work from there! See the main function of parse_php.py for exact usage details.

Once the compiler arguments have been extracted and converted to a list they can just be provided to the index.parse method when constructing the translation unit for a given file. e.g.

comp_args = load_project_data(cc_file)
args = comp_args[file_path]
index = clang.Index.create()
tu = index.parse(file_path, compiler_args)

How do I find all functions within a file?

Our first task is to find all calls to zend_parse_parameters and note the name of the top level backend API function so that we can later relate this to a user-space PHP function call. To do this we first need to find all function declarations within a translation unit for a source file. e.g the process_all_functions method:

for c in tu.cursor.get_children():
    if c.kind == clang.CursorKind.FUNCTION_DECL:
        f = c.location.file
        if file_filter and f.name != file_filter:
            continue
        fmt_str = process_function(c)

This is a fairly simple task — we iterate over all children of the top level translation unit for the given source file and compare the cursor kind against FUNCTION_DECL, for a function declaration. It’s worth noting that the children of a translation unit can include various things from files that are #include‘d. As a result, you may want to check the the file name associated with the location for the child is the name of the file passed to index.parse.

How do I find all calls to a particular function?

For each function declaration we can then iterate over all of its children and search for those with the kind CALL_EXPR. This is performed in the process_function method. Firstly, we need to check if the function being called has the correct name. The first child of a CALL_EXPR AST node will represent the function itself, with the rest representing the arguments to this function.

The last statement isn’t quite correct. The C and C++ standards specify a number of conversions that can take place on a function call, so the direct children of a CALL_EXPR AST node will represent these conversions and then the direct child of these conversion nodes will contain the details we require. Clang does have a mode to dump the AST for a given source file so if you’re curious what the AST actually looks like try passing -ast-dump or -ast-dump-xml as parameters to the clang compiler.

As an example, on a function call we encounter a CALL_EXPR node and its first child represents the decay of the function to a pointer as specified by the C/C++ standards. This node will have the kind UNEXPOSED_EXPR. It will have a single child of type DECL_REF_EXPR from which we can retrieve the function name via the displayname attribute. The following code in process_function performs the function name extraction and comparison.

if n.kind == clang.CursorKind.CALL_EXPR:
    unexposed_exprs = list(n.get_children())
    func_name_node = get_child(unexposed_exprs[0], 0)
    if func_name_node.displayname == "zend_parse_parameters":
        ...

How do I extract the parameters to a function call?

Once we know we have the correct function call we then need to extract the format string parameter that provides the type specification for the user-space API call. As with the node representing the function, arguments may also undergo conversions. The second argument to zend_parse_parameters is the one we are interested in, and it is usually specified as a string literal. However, string literals undergo a conversion from an rvalue to an lvalue. To account for this, we again go through an UNEXPOSED_EXPR node representing the conversion to access the node for the string literal itself.

Things get a little inconvenient at this point. Once we have the node representing the string literal we ideally want some way to access the actual value specified in the source code. It turns out that this functionality hasn’t been added to the Python bindings around libclang yet. Fortunately, a guy by the name of Gregory Szorc has been working on a branch of libclang and the associated Python bindings that adds the get_tokens method to AST nodes. His code is great but also contains a bug when get_tokens is called on an empty string literal. Until that gets fixed I’ve mirrored his cindex.py and enumerations.py files. Drop both into your clang/bindings/python/clang directory and you should be good to go!

The actual code to extract the string literal given the node representing it is then straightforward.

if tkn_container.kind != clang.CursorKind.STRING_LITERAL:
    raise VariableArgumentError()

tokens = list(tkn_container.get_tokens())
if tokens[0] is None:
    return ""

The check on the kind of the node is required as in a miniscule number of cases the format string parameter is passed via a variable rather than directly specified. This situation is so rare that it isn’t worth handling. The only other slightly odd thing is that when get_tokens is called on an AST node representing a string literal, like the first parameter in func("xyz", a), it will return a generator for two elements. The first will be the string literal itself, including the double quotes, while the second will represent the comma separator. I’m not sure if this is a bug or not but it’s not a big deal. If the string literal is empty then the generator will return None.

Conclusion

At this point we can associate the functions calling zend_parse_parameters with the format string passed to that call, and thus the types and number of arguments that the function expects. In a fairly small amount of Python we have a robust solution to problem 2, discussed earlier. A question you may have is How do you then associate the function name/argument specification from the C code to the corresponding user-space PHP function?. Well, PHP is nice in that it doesn’t do a massive amount of name mangling and consistently if we have a user-space function called func_one then the backend C function will be called zif_func_one (in the code it will probably look like PHP_FUNCTION(func_one) but Clang will expand the macro for you).

When parse_php.py terminates it will output something like the following:

INFO:main:API info for 518 functions in 68 files written to php_func_args.txt

Inspecting the output file we’ll see function to type specification mappings:

# /Users/sean/Documents/git/php-src/ext/standard/levenshtein.c
zif_levenshtein ss sss sslll
# /Users/sean/Documents/git/php-src/main/main.c
zif_set_time_limit l
# /Users/sean/Documents/git/php-src/ext/standard/versioning.c
zif_version_compare ss|s
...

This tells us that a user-space PHP script can correctly call levenshtein with two strings, three strings or two strings and three longs. It can call set_time_limit with a single long only. version_compare can be called with two or three strings and so on and so forth. So, with this information at hand we can construct a PHP fuzzer that no longer has to guess at the number of arguments to a function or their types.

Extensions

The most obvious extension of this code that comes to mind is in pulling out numeric values and string literals from comparison operators. For example, say we’re analyzing a function with the following code:

if (x == 0xc0)
    ...
else
   ....

It might be an idea to pull out the constant 0xc0 and add it to a list of possible values for numeric input types when fuzzing the function in question. Naturally, you’ll end up with extra constants that are not compared with the function inputs but I would guess that this approach would lead to some gain in code coverage during fuzzing.

Resources

Introduction to writing Clang tools
Eli Bendersky’s blog post on parsing C with libclang
Unofficial Clang tutorials

[1] This is what I did when I first fuzzed PHP a long time ago. You hit a lot of incorrect type/argument count errors but you can get so many executions per second that in general it works out just fine =)

[2] Two days after I wrote this it seems that support for the compilation database functionality was added to the Python bindings. You can read about this feature here. Using this feature is probably a more portable way of gathering compiler arguments and pumping them into whatever APIs require them.

Anatomy of a Symbolic Emulator, Part 3: Processing Symbolic Data & Generating New Inputs

In this final video in the series we go over how to generate new inputs for a program once we detect a user-influenced conditional branch. At the end there is also an example of the type of condition/resulting formula that we get from working on a real file parser, in this case libwebp.

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p)

Conclusion
This type of emulation, input generation and formula checking does not need to be limited to conditional jumps. As I discussed in a previous post you can use a similar approach to discover variable ranges, check for variable relationships and assist in figuring out complex algorithms. For example, one could generate a query to the solver every time an argument to malloc is found to be influenced by the user, or export a list of all functions that operate on user-influenced data to IDA for manual review. (In fact, a light-weight version of this approach in combination with fuzzing and an IDA importer is possibly more generally useful to an individual auditor than going down the route of full on whitebox fuzzing. More on that later =))

Anyway, I hope these videos provide some insight into how a whitebox fuzzer might work as well as one approach to building a symbolic emulator. To give an idea of the effort involved – the combined whitebox fuzzing, trace parsing and emulation code (along with supporting libraries) comes to around 10,000 lines of Python. Of this, the emulator itself is only 3000 lines or so. The PIN tracer is just under 1000 lines of C++.

Tracing is currently fairly unoptimised and parsing something like a video or image while tracing can result in a factor of 10-100 increase in running time. This usually means a wait of 30 seconds, which isn’t too bad for whitebox fuzzing as tracing is not performed too often but for other uses of a symbolic emulator (like tracing while fuzzing normally) this will require some work. The emulator itself is Python based and as such is not lightning fast. In the default run-mode it emulates ~5000 instructions per second. What this translates to is about 30-40 minutes per trace of an average file parser. This isn’t as bad as you might think however as the tests cases generated tend to be much more effective at hitting new code than what you would get from dumb fuzzing. Despite this we still need performance gains and I’m working on a few different solutions for that. Somewhere around 30,000+ instructions per second would be what I would consider approaching acceptable =)

To preempt the inevitable questions – for now JESTER is not publicly available but that may change in the future. It’s very much a research prototype at the moment where we’re testing out several approaches to improving performance and general usefulness. However, if you are interested in working on this type of research post a comment with a contact address (it won’t appear publicly) as I’m fairly sure we are currently hiring.

Anatomy of a Symbolic Emulator, Part 2: Introducing Symbolic Data

In the previous post I discussed one way to go about gathering a trace for emulation. In this I’m going to talk about how we go about emulating such a trace, how and why we hook functions as they are emulated and how symbolic operations are performed.

As before, this post is accompanied by a video which demonstrates the code in action. Unlike the previous post I’ve decided to skip the paragraphs of rambling and instead most of the info is in the actual video itself =)

Topics covered:
– Introducing symbolic data via function hooks
– Performing computations on symbolic data

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p. Btw, near the end of the video I said something along the lines of “one of the advantages of whitebox fuzzing over symbolic emulation”. That makes no sense =) What I meant to say was “one of the advantages of whitebox fuzzing over normal symbolic execution”.)

Anatomy of a Symbolic Emulator, Part 1: Trace Generation

A couple of months ago there was an ACM article on the SAGE whitebox fuzzing system from Microsoft Research. SAGE is one of the most interesting products of research on automated program testing in recent years and, according to Microsoft, has been used to find a massive amount of bugs in their various file parsers.

At its core, SAGE contains a symbolic emulator for executing instruction traces over symbolic data. As well as whitebox fuzzing, symbolic emulators are fairly useful things for a variety of reverse engineering, vulnerability discovery and program analysis tasks. Essentially, a symbolic emulator is a CPU emulator that not only supports operations on concrete numeric values but also on abstract values that may represent a range of concrete values.

In this series of posts I’m going to give an overview of a Python-based symbolic emulator for x86 that I’ve been working on (called JESTER) and show how it can be applied to the problem of whitebox fuzzing. Hopefully this will give an idea of how symbolic emulation works (it’s fairly simple) and also provide some insight into how systems like SAGE operate. Part 2 in this series covers how to introduce symbolic data into the system, and part 3 covers how the system can generate new inputs for the target program that take new paths.

Consider the x86 instruction add eax, ebx. Operating over concrete values an emulator will do the obvious thing of taking the value in EAX, adding it to EBX and then storing the result back in EAX. It will also update the various flags that are affected. Over symbolic values however the result is a bit more interesting. Lets assume that EAX contains the abstract value V1 which represents an unconstrained 32-bit variable, and EBX contains the concrete value 0x10. In this case the emulator will create a new abstract value V2 which represents the addition of V1 and 0x10 and store that back in EAX. Diagrammatically, we can see that EAX now contains something that is a function rather than a single value.

      v1   10
       \   /
    EAX: +    

A slightly more complex diagram shows what the Zero Flag would hold after the above instruction.

      v1   10
       \   /
         +    0
          \  /
           ==   1    0
            \   |   /
        ZF: if-then-else

I purposefully used the word ‘function’ because what we end up with, in registers and memory, are expression trees that map from a given set of inputs to an output. As more instructions are emulated these trees get bigger and more difficult to reason about so people usually take the approach of exporting them to a SMT solver and querying their models that way. The obvious applications being input crafting, tracking user-influenced data and checking security properties. This is fairly well documented in previous posts and in a decade worth of academic literature so I won’t delve into the details.

The point of this post is instead to look at the overall architecture of a symbolic emulator with the aim of illuminating some of the components involved, more directly than is typically done in formal descriptions. I also want to give people an idea of how much or how little effort is involved in building these tools. In order to demonstrate the use of a symbolic emulator I’ll apply it to the problem of whitebox fuzzing i.e. using a symbolic emulator in combination with a SMT solver to generate inputs for a program guaranteed to force execution down a new path.

While writing this series of posts Rolf Rolles posted a great video/blog entry on the topic of input crafting using an SMT solver. Taking a leaf out of his book I’ve decided to accompany these with a video that demonstrates the tools described in operation and should ideally give some insight into their construction. The video is linked at the end but the following wall of text will give some context and might be worth glancing over. This isn’t the most entertaining of entries in the series and is mostly for completeness so if you’re bored out of your mind I accept minimal responsibility =)

1. Trace generation

An emulator needs some way to know what instructions execute and it also needs a starting memory and thread context. There are a few different approaches to getting such information. The Bitblaze/BAP research groups modified Qemu and hook in directly there, the guys working on S2E do something similar and I previously wrote a C++ library that was used as part of a Pintool at run time. There are a couple of problems with tying your emulation directly into the run time environment of the tool however. Firstly, it’s a lot more annoying to debug an extension to Qemu or PIN than n separate emulator and secondly, it prevents you from doing the emulation on a separate machine to the tracing. The second issue is probably the most important in the long run as to really scale whitebox fuzzing to the point where it is useful requires parallelism.

The approach I took this time around is directly inspired by the work of MSR on their Nirvana/iDNA tool, but much more simplistic. Instead of using the Pintool to do the emulation I use a lightweight one to just trace the instructions executed and other interesting events, like image loads/unloads, system calls and debugging info. If you’ve used PIN before then most of what I’m about to describe will be obvious and fairly boring so you might want to skip on to part 2 of this series of entries.

The trace format is uncompressed and unoptimised and to date I’ve not had any problems with that. A typical segment just looks as follows (L denotes an image load, I an instruction execution and C provides debugging information as discussed below):

L;4;/lib32/libc.so.6;f5c54000;157244
L;5;/lib32/libm.so.6;f5c2e000;24790
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f8;EBP:ffb291f8 ... 
I;0;8048fc5
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291f0;EBP:ffb291f8 ... 
I;0;8048fc8
C;0;EAX:ffb292a4;EBX:f5da9ff4;ECX:53f78923;EDX:5;ESP:ffb291ec;EBP:ffb291f8 ... 

In the early stages of the project I worried about this and thought I’d have to come up with some compression method but that hasn’t been the case. Most file parsers generate traces that can be measured in 10s of millions of instructions and things of that scale easily fit in a few gigabytes of storage.

1.1 Debugging Assistance

Writing an emulator of any kind can be tedious work. It’s easy to make mistakes and get the semantics of an instruction slightly wrong or set a flag incorrectly. Initially I tried to counter this by writing unit-tests but it quickly became obvious that 1) These were never going to be exhaustive and 2) They were as likely to have mistakes as the emulator code itself. Instead, I added a debug mode to the tracer that logs the register values after each instruction (The lines starting with a “C” above). This then allows the emulator to compare its register values to the ones we know it should have and highlight any discrepancies. Tracing everything in /usr/bin/ and checking these values is a hell of a lot more exhaustive than any unit-testing I would have done! The only reason I’m mentioning this is that I’d recommend it to anyone writing something of this nature. The best tests are by far those you can extract from real binaries.

1.2 Handling system calls

One of the disadvantages of using a user-land tracer is that you miss out on any updates to memory that happens within the kernel. The only real way to handle this correctly is to define per-system-call handlers that know which memory addresses a system call will update based on its arguments or return value. In PIN this is fairly straightforward, you register a syscall entry and exit handler, get the syscall args or return value and then log whatever you need on exit.

int main(int argc, char *argv[])
{
        ...

        PIN_AddSyscallEntryFunction(SyscallEntry, 0);

        ...
}

VOID SyscallEntry(THREADID tid, CONTEXT *ctxt, SYSCALL_STANDARD std, VOID *v)
{
        ADDRINT syscall_num = PIN_GetSyscallNumber(ctxt, std);
        ...
#ifdef LINUX
        // Handle Linux syscalls
        switch (syscall_num) {
        case SYS_READ:
        {
                ADDRINT read_buf_ptr = PIN_GetSyscallArgument(ctxt, std, 1);
                size_t read_sz = PIN_GetSyscallArgument(ctxt, std, 2);

                t_data->log_addrs[read_buf_ptr] = read_sz;
                break;
        }
        ...
}

Handling each and every system call might seem like an onerous task but if you’re working on particular types of software (e.g. file parsers) then you can get away with a minimal subset e.g. open, read, lseek, mmap and a few others. My general approach is to just add them as necessary. You’ll encounter many more along the way but it turns out not a whole lot end up having any interaction with the user controlled data you’re interested in.

In the trace log format I included support for events other than those shown in the above snippet.). For syscalls as just discussed there is the M event which looks like as follows and tells the emulator to update the memory address given with the contents of a file.

M;0;f5f97000:syscall_c0_f5f97000_1000_1

There is also the ‘R’ event which tells the emulator to update a register with a particular value. This is useful for instructions you can’t handle for whatever reason. Other than that there isn’t really anything to capturing a trace. The only thing I haven’t mentioned is that on starting tracing, either at a given address or the programs entry point, you also need to log the programs memory and thread contexts at that point in order to give your emulator starting values. This is fairly straightforward though and PIN provides all the required API calls.

(You probably want to click the “Watch on YouTube” option on the bottom right of the video and set the quality to 720p. The tools discussed are not publicly available but that may change in the future.)

Parts 2 and 3 of this series can be found here and here.