Some Early-Stage Work on Statistical Crash Triage

Last week at Infiltrate I presented some early-stage work on crash triage under the title “Automated Root Cause Identification for Crashing Executions“. The slides can be found here, and the presentation notes contain further details in draft form.

A better title for this work would probably have been “Statistical Crash Triage” or possibly something involving the term “predicate synthesis“, as really those things are at the core of what was presented. The main algorithm can be summarised fairly quickly as follows:

  1. Run the target application on all the good inputs and all the bad inputs. Record coverage information.
  2. Use the coverage information to predict what functions are likely relevant to the crash.
  3. Rewrite those functions to insert instrumentation which will record all computed values and other properties.
  4. Rerun the instrumented version of the target application on all the good and bad inputs.
  5. From the recorded state, and predicate templates, synthesise and evaluate predicates over the program’s variables.
  6. Apply a lightweight statistical analysis to the evaluations of these predicates to shake out those relevant to the crash from those that are not.

The actual details are in the slides so I won’t go into that any further. The advantages of this approach are that it isn’t tuned towards one bug type, or class of applications, and simply relies on large amounts of input data and lightweight analyses to discover ‘interesting’ things. Alongside that, the analyses at each stage are parallelisable and so it’s quite simple to scale up. On a reasonably sized target (PHP) the approach is fast enough to fit into a normal workflow.

The most significant downside is that the generality means that if a bug is best described by a very complicated/precise predicate it may not be discovered (although more simple predicates may be, which can help a user manually discover the precise predicate more easily). In terms of the work as a whole there is also a weakness in the limited evaluation so far, but this is something I’m currently in the process of alleviating. In particular I’m working on testing with a wider variety of memory management issues as those pose some unique challenges for the algorithm above.

Besides the analysis algorithm itself, I also want to quickly mention CrashCorpus*. This is a corpus I’m currently working on for the evaluation of crash triage (and related) tools. A significant issue with the academic literature is that the algorithms tend to be tested almost exclusively on toy programs or synthetic bugs. The goal of CrashCorpus is to eliminate that by providing easy to build, easy to run, real-world targets. I’ll have more info on this once I get closer to a public release.

* I’m still building towards a public release of the corpus as it needs more targets before it is truly useful, but if you’d like to contribute drop me an email (sean _at_ vertex.re). All I require is a set of crashing inputs, a set of non-crashing inputs and the source code for the target. I’ll take care of the rest.

 

Infiltrate 2011 Slides

The slides for most of the talks from this years Infiltrate have gone online! Among them you can find the slide deck for Attacking the WebKit Heap, which Agustin and I gave.

The talks were awesome and I’d recommend grabbing them all. Halvar’s, titled State Spaces and Exploitation, was one of my favourites. I generally believe that the reason we in industry, as well as university based research groups, sometimes fail to build better tools is because we don’t spend enough time reflecting on the nature of exploitation and bug finding and as a result end up solving the wrong problems. Halvar spent about half of his talk addressing one way to think about exploits, which is programming a ‘weird machine’ that lives inside a program and is unlocked by a bug trigger. It’s an interesting way to look at things and, as he mentioned, similar to how many of us think of exploit development when it comes down to it.

As far as I know, we’ll only be releasing audio for one of the talks, Nico’s keynote on Strategic Surprise, which can be found here. Also worth checking out, educational, funny and just a little bit troll-y … what more can you ask =D

“We don’t care about nulls because this ain’t no strcpy shit” – Ryan Austin, by consensus the best quote of the conference.

Exploit Necromancy in TCMalloc – Reviving the 4-to-N Byte Overflow Primitive with Insert to FreeList[X]

A couple of months back while looking into a heap overflow in Chrome* I found myself poking around in the internals of TCMalloc. What I found there was pretty interesting. Perhaps the implementers have been watching Twitter and decided to take a proactive approach to security researchers moaning about the difficulties of modern heap exploitation. Maybe they just got caught up in making a blazing fast allocator and couldn’t bring themselves to slow it down with nasty things like integrity checks. Either way, TCMalloc provides a cosy environment for heap exploits to thrive and is worth fully exploring to discover the possibilities it provides.

At Infiltrate next weekend, Agustin Gianni and I will be discussing (among other things) TCMalloc from the point of view of exploitation. Essentially our research into TCMalloc wasn’t so much ‘research’ as it was vulnerability necromancy. In the name of speed TCMalloc has forsaken almost any type of sanity checks you might think of. There are annoyances of course but they are artefacts of the algorithms rather than coherent attempts to detect corruption and prevent exploitation. As a result we can revive a number of primitives from heap exploits past as well as some unique to TCMalloc.

Like most custom allocators, TCMalloc operates by requesting large chunks of memory from the operating system via mmap, sbrk or VirtualAlloc and then uses its own methods to manage these chunks on calls to malloc, free, new, delete etc. Agustin wrote a high level overview here and Google’s project page is also useful for gaining a general idea of its workings. In this post I’m not going to go into the exact details of how the memory is managed (come see our presentation for that!) but instead briefly discuss how thread local free lists function and one useful exploit primitive we gain from it.

ThreadCache FreeList allocation
The front end allocator for TCMalloc (with TC standing for ThreadCache) consists of kNumClasses** per-thread free lists for allocations of size < 32768. The FreeLists store chunks of the same size. The bucket sizes are generated in SizeMap::Init and allocations that don’t map directly to a bucket size are simply rounded up to the next size. For allocations larger than 32768 the front end allocator is the PageHeap which stores Spans (runs of contiguous pages) and is not thread specific. The following code shows the interface to the allocator through a call to malloc***.

3604 static ALWAYS_INLINE void* do_malloc(size_t size) {
3605   void* ret = NULL;
3606 
...
3611   // The following call forces module initialization
3612   TCMalloc_ThreadCache* heap = TCMalloc_ThreadCache::GetCache();
...
3621   if (size > kMaxSize) {
3622     // Use page-level allocator
3623     SpinLockHolder h(&pageheap_lock);
3624     Span* span = pageheap->New(pages(size));
3625     if (span != NULL) {
3626       ret = SpanToMallocResult(span);
3627     }
3628   } else {
3629     // The common case, and also the simplest.  This just pops the
3630     // size-appropriate freelist, afer replenishing it if it's empty.
3631     ret = CheckedMallocResult(heap->Allocate(size));
3632   }
...

At line 3612 the ThreadCache pointer for the current thread is retrieved. This object contains a number of thread specific details but the one we are interested in is the FreeList array. A FreeList object contains some metadata and a singly-linked list of free chunks that are managed by very simple primitives such as SLL_Pop, SLL_Push etc. If the allocation size is less than 32768 then the following code is called to retrieve a chunk of the required size.

2888 ALWAYS_INLINE void* TCMalloc_ThreadCache::Allocate(size_t size) {
2889   ASSERT(size <= kMaxSize);
2890   const size_t cl = SizeClass(size);
2891   FreeList* list = &list_[cl];
2892   size_t allocationSize = ByteSizeForClass(cl);
2893   if (list->empty()) {
...
2896   }
2897   size_ -= allocationSize;
2898   return list->Pop();
2899 }

The correct FreeList is retrieved at line 2891 and presuming that it is not empty we retrieve a chunk by calling the Pop method. This results in a call to SLL_Pop with the address of the pointer to the head of the free list.

 761 static inline void *SLL_Next(void *t) {
 762   return *(reinterpret_cast(t));
 763 }
...
 774 static inline void *SLL_Pop(void **list) {
 775   void *result = *list;
 776   *list = SLL_Next(*list);
 777   return result;
 778 }

Insert to FreeList[X] – Reviving the 4-to-N byte Overflow Primitive
The effect of SLL_Pop is through SLL_Next to follow the list head pointer and retrieve the DWORD there and then make that the new list head. Notice there are no checks of any kind on the value of the *list pointer. When I first saw a crash within TCMalloc it was at line 762 in the SLL_Next function. The value of t was 0x41414141 which I had previously overflowed an allocated chunk with. What this means is that on that call to SLL_Pop the list head was an address that I controlled. How could this happen? Consider the following FreeList layout:

FreeList example

The above FreeList has 3 chunks. For simplicity assume that the addresses of A and B are contiguous****. On the first call to malloc chunk A is returned to the application.

FreeList allocation

Assume an overflow then occurs when the application writes data to A and B is partially corrupted.

Overflow of Chunk A into Chunk B

The first DWORD of B is the pointer to the next chunk in the FreeList and so we have corrupted the singly-linked list. The first chunk is now B and, as we control the first DWORD of this chunk, we control what the next chunk in the FreeList will be. The chunk C is effectively no longer part of the FreeList. The next allocation of this size will then return the chunk B (presuming no free calls on chunks of this size, within this thread, have occured. Free chunks are prepended to the head of the list) and it will give us control of the list head pointer when *list = SLL_Next(*list) executes.

Allocation from a corrupted FreeList

As we control the DWORD at **list we control what the new list head is. In my initial TCMalloc crash this set the list head to 0x41414141. One more allocation of this size will then call SLL_Pop with a controlled list head pointer. The crash I encountered was when SLL_Next attempted to read the next pointer at 0x41414141.

The important part of all this is that by controlling the list head pointer we control the address of the chunk returned and can give back any usable memory region we want to the application. So after one more allocation the situation is as follows:

A memory region of our choosing is handed back to the application

The result of this allocation is a pointer which we control. It is important to note that the first DWORD at this address then becomes the new list head. For a clean exploit it is desirable for either this to be 0x0 or the FreeList list length_ attribute to be 0 after returning our controlled pointer. If we cannot force either of these conditions then future allocations of this size will follow this DWORD, which may not be under our control, and may result in instability in the program before we gain code execution. Another point to note is that if a free call occurs on a pointer that has not previously been noted by TCMalloc as part of a memory region under its control then it will raise an exception that will likely terminate the program. This is important to keep in mind if the memory location we are inserting into the free list is not controlled by TCMalloc. If this is the case then we have to prevent this address from being passed to free before we gain control of the programs execution.

To gain code execution from this primitive it is necessary to find the address of some useful structure to hand back to the application and then have it overwrite that structure with data that we control. For modern systems this will require a memory leak of some kind to find such an address reliably (although heap spraying of useful objects might be an interesting alternative if we can write over one of these objects and then trigger a call through a function pointer or something similar). Despite this requirement, we have a functioning exploit primitive from a heap overflow and one that has not required us to deal with any heap integrity checks or protections. This is not a bad place to be in given the effort that has gone into securing the default allocators for most operating systems.The primitive basically gives the same effect as the Insert to Lookaside technique that was popular on Windows XP but has since been killed on Windows Vista and 7.

Conclusion
In this post I’ve given a quick overview of how we can convert a 4-byte overflow into one that overflows N bytes in an application using TCMalloc. This type of technique has been seen in various places in the past, most notably when overflowing chunks in the Windows XP Lookaside list. It has been largely killed elsewhere but due to the lack of integrity checks in TCMalloc it lives on in a relatively easy to use fashion.

At Infiltrate Agustin and I will discuss a variety of topics related to exploiting WebKit and TCMalloc in much greater detail. The Insert to FreeList[X] technique is but one example of a simple and old heap exploitation tactic that has been revived. Many other exploitation vectors exist and with a grasp of how the allocator works the possibilities are interesting and varied. In our presentation we will describe some of these techniques and also the details of how TCMalloc manages the processes of allocation and deallocation such as is relevant to manipulating the heap for exploitation.


* TCMalloc is not the allocator used by the Chrome Javascript engine V8. That has its own allocator that can be found in the src/v8/src/ directory of the Chrome source. Have a look at spaces.[cc|h] and platform_X.cc to get started.
** For WebKit in Safari kNumClasses is 68 but for Chrome it is 61. There seem to be some other differences between the various uses of TCMalloc and are worth keeping in mind. For example, in Chrome the maximum free list length is 8192 whereas in Safari is is 256.
*** We’re looking at the TCMalloc code embedded in WebKit revision 79746 Source/JavaScriptCore/wtf/FastMalloc.cpp. It’s effectively the same as that in the Chrome release and elsewhere, modulo the previous point.
**** FreeLists are created from Spans which are contiguous pages in memory that get subdivided to give the FreeList chunks. The chunks are prepended to the list during creation though so they end up in reverse address order. This means that if A, B are two contiguous addresses then the FreeList will initially be ordered as B -> A. In order to get the desired ordering we need to rearrange the fresh FreeList by allocating B, allocating A, free’ing B and then free’ing A. This will result in the ordering A -> B and another allocation from this FreeList will give back the address A.

Heap Scripts for TCMalloc with GDB’s Python API

When writing heap exploits it’s necessary to be able to view the heap state during debugging. As part of our work on TCMalloc, Agustin and I have written up some heap scripts for Immunity Debugger and GDB that make the process of tracking what TCMalloc is up to quite easy. There are quite a few scripts that do similar things for ID and different heap implementations so in this post I’m going to focus on the GDB side of things. Recent versions of GDB contain an embedded Python interpreter that provides easy access to the internals of an application being debugged. Being able to write scripts in Python to automate debugging tasks is really useful and hopefully the GDB guys will maintain the project.

The scripts I will be discussing in this post can be found here. Copy the gdbinit file your home directory or the one where you will be launching gdb from and rename it to .gdbinit. Modify the pyscripts_dir variable to point to the directory where you extracted the scripts. To load the commands run source /path/to/scripts/dump_free_list.py or source /path/to/scripts/search_free_lists.py. This will make the commands dump_free_list and search_free_lists available within gdb.

The data structures used by TCMalloc are relatively simple so the scripts have little to do besides reading memory and walking lists. Each thread in an application using TCMalloc has its own TCMalloc_ThreadCache object which contains information on the FreeLists for that specific thread. The FreeLists themselves are ThreadCache_FreeList objects, each of which contains a list_ attribute that points to the head of a singly linked list of free chunks. We can access the TLS for a given thread via pthread_self() in GDB and find the ThreadCache_FreeList pointer at the offset -4 from there. The file tcmalloc.py contains abstractions for the ThreadCache_FreeList and TCMalloc_ThreadCache structures. In dump_free_lists.py we can see the initialisation of the ThreadCache abstraction via the following code:

 24         tls = gdb.parse_and_eval("pthread_self()")
 25         # threadlocal_heap is at TLS - 4
 26         threadlocal_heap = buf_to_le(
 27             self.cur_proc.read_memory(tls - 4, DWORD_SIZE))
 28 
 29         tc = ThreadCache(self.cur_proc, threadlocal_heap)

The ThreadCache instance then provides access to the FreeLists for the current thread through getFreeLists. The size classes that TCMalloc uses to bucket chunks together for allocations less than 32768 in size are generated at run-time. To view them run tcmalloc.py outside of gdb with no arguments.

The number of size classes, and hence the number of FreeLists per thread, is dependent a constant that may change between applications embedding TCMalloc. For Chrome it is 61 and hence we have 61 different FreeLists per thread. If we run dump_free_lists within gdb with no arguments it will dump all free lists, their lengths and the chunk size that list is responsible for.

TCMalloc FreeLists within Chrome on Linux

The getFreeLists function will return each of these FreeLists as a FreeList object. This object contains a list_ptr attribute that corresponds to the list_ pointer found in the TCMalloc source for each FreeList. It points to a singly linked list of free chunks that we can iterate over using the getChunks function of a FreeList object.

If we run dump_free_lists with one or more space separated addresses it will treat them as pointers to ThreadCache_FreeList structures and dump them accordingly.

The chunks in a given FreeList

The scripts archive also contains the search_free_lists command which will search the FreeLists to see if an address lies within a chunk on any of the FreeLists. After Infiltrate we will release the rest of the scripts for ID and GDB but these should be enough to get you started with TCMalloc and GDBs Python API.

Validity, Satisfiability and Code Semantics

In a recent post to DD I mentioned some interesting features that will be available in Immunity Debugger 2.0. These features rely on a translation layer from x86 code to SMT formulae that Pablo constructed as part of his work on DEPLIB 2.0. In the past few weeks I got some time to put together a few scripts for ID that replicate some of the more useful tasks that we can use a solver for. In this post I am going to elaborate on one such script which is designed to find ROP gadgets meeting user specified conditions.

find_gadget.py is a relatively simple piece of software which is a testament to Pablo’s API development, the Python programming language and, I think quite importantly, the suitability of mathematical logic for reasoning about machine code. This final point is of course unsurprising but it provides good motivation when considering the merits of abstracting x86 code to more logical/mathematical representations for analysis.

We begin our gadget search by running gadgets.py. Currently this script finds any instruction sequence ending in a ret instruction (with support for pop REG/jmp REG etc. in the works). This takes about 360 seconds to run on QuickTimeAuthoring.qtx (2.16MB) and finds 80,000 candidate gadgets. For now no further analysis is done on these gadgets but we’re working on some fun stuff in that area.

At this point we can use find_gadget.py to search the candidate gadgets for one that meets whatever semantics we have in mind. For now the semantics are specified via the -d, -n and -v/-s options, i.e. DESTINATION RELATION SOURCE/VALUE. For example, to represent EAX <= [EBX+24] we would use -d EAX -n <= -s [EBX+24]. (This is a little cumbersome and not as flexible as one might like so we built a BNF grammar based on pyparsing that allows for arbitrary semantics to be specified as constraints and that should be swapped in pretty soon.) The full set of arguments are shown in this screenshot:

Arguments to find_gadget.py

Once we have specified our arguments the script gets on with the fun part. Our algorithm is as follows:


for gadget in candidiate_gadgets:
    sa = SequenceAnalyzer(reg_model, flag_model, mem_model) # 1
    sa.analyze(gadget.addr, gadget.depth) # 2
    
    if relation == '=':
         rel_expr = sa.solver.eqExpr(dest, src) # 3
    elif relation == '<=':
         ...

    preserve_expr = None
    if preserve_regs != None: # 4
         for reg in preserve_regs:
              eq_expr = sa.solver.eqExpr(sa.regs_before[reg], sa.regs_after[reg])
              preserve_expr = sa.solver.boolAndExpr(preserveExpr, eq_expr)
        rel_expr = sa.solver.boolAndExpr(rel_expr, preserve_expr)

    res = sa.solver.queryFormula(rel_expr) # 5
    if generic_gadgets_only:
         if res == VALID:
              # VALID GADGET FOUND
     else:
          if res == SATISFIABLE:
              # SATISFIABLE GADGET FOUND

#1 Modelling registers, flags and memory
One of the arguments you may have noticed to the script is -c. This argument tells the script that we want to look for ‘context specific’ gadgets that satisfy our semantics when current values in registers, memory and flags are taken into account. The alternative is to look for generic gadgets that will meet the required semantics regardless of the context e.g. inc eax will only set eax to 0 if it previously contained -1 while xor eax, eax will always set it to zero. We’ll discuss this a little more on point #5.

#2 Gadgets to equations
The main coding effort in building this framework was building a transformer for each x86 instruction into the domain of our theorem prover. This requires one to encode accurately the semantics of the instruction over flags, registers and memory and is the backbone of any tool that reasons about assembly code using a solver. This long and drawn out functionality is hidden behind the analyze function which iterates over each instruction and updates the solver state. Once this has completed our solver internally has an accurate representation of the gadgets semantics in the form of several equations.

# 3 & # 4 Adding our gadget constraints
Once one has a representation of a sequence of instructions as equations the next step is typically to append some further constraints derived from the task at hand and then query the solver to determine if they are valid or satisfiable, and if so, to retrieve an input that makes it so. In our case our constraints simply specify a relation between an output register/memory location and an input register/memory location or a constant. We also allow the user to specify a number of registers that the gadget should preserve. We end up with a formula specifying something like (ESP_after == EAX_before+4 AND EBX_after == EBX_before), which is essentially a stack swap with the buffer pointed to by EAX and preserving the EBX register.

# 5 Context specific vs. generic gadgets
After encoding our constraints as a formula in steps #3 and #4 we can then query the solver to determine the status of these constraints in the context of the gadget encoding constructed in step #2. What do we mean by ‘status’? Well, when considering the SAT problem a formula can either be valid, invalid, satisfiable or unsatisfiable. A formula is satisfiable if there exists an assignment to free variables that makes it true and is unsatisfiable otherwise. A formula is valid if its negation is unsatisfiable or invalid otherwise. In other words, a formula is valid if any variable assignment makes it true. Why is this important? If we’re looking for generic gadgets then we want them to meet our constraints regardless of the context, therefore we leave all memory locations, registers and flags free and require that our constraints are VALID in the context of the gadget. That is, there exists no assignment to the memory locations, registers and flags that would result in the gadget not meeting our constraints.

The next few screenshots show some usage examples.

A generic gadget to set EAX to 1
A generic gadget to set EAX to 1 while preserving ESI
Finding multiple gadgets to achieve the same result

find_gadget.py is a very simple script but at the same time can find some very complicated instruction sequences in a short amount of time. In particular it can find tricky stack swapping gadgets and can process about 200,000 gadgets in 3 hours, which is probably a bit quicker than doing it by hand. In some upcoming posts I’ll elaborate a bit more on what can be done with the x86 -> SMT transformer -> solver code. For an idea of some of the things we’re hoping to implement many of the papers published in the last few years on symbolic execution and derived techniques are a good starting point and quite a few are linked from the Practical Software Verification Wiki

ISSA Ireland seminar

I got back from the latest ISSA Ireland seminar today. The event was held in Dublin and consisted of a number of talks and a panel discussion. There was an excellent crowd and some really interesting people with conversation varying from program analysis to governmental cyber-security policy.

I gave a presentation titled ‘VoIP Security: Implementation and Protocol Problems‘ which was a relatively high level talk about finding bugs in VoIP applications and deployments. It consisted of an overview on finding vulnerabilities in VoIP stack implementations and auxiliary services and introduced some of the common tools and methods for discovery/enumeration/attacking VoIP deployments.

Hart Rossman, of SAIC, gave an excellent talk which touched on a number of different issues around developing and implementing cyber-defence policies. Aidan Lynch, of Ernst and Young, discussed security issues in deploying VoIP in a corporate environment. The panel discussion focused on securing national infrastructure (or so I’m told because I managed to miss that). And finally there were a number of lightning talks; of particular interest was one on the application security process in Dell which introduced me to the concept of threat modelling and Microsofts TAM tool. (There is an MSDN blog here which contains a lot of good information on the topic in general)

It was an educational day all round and I’d like to thank the organisers for inviting me to present and being such excellent hosts.