Dataflow analysis without considering the effects of conditional branching is of limited use when trying to generate an exploit or find a vulnerability. For lightweight analysis the inaccuracies introduced may be acceptable, but when creating an input that should drive the program to a specific point we must consider all conditional branches affected by user input and represent the inherent restrictions in the path condition.
Consider the example:
cmp eax, ebx jg VULNERABLE_FUNCTION
In this case, if either eax or ebx are tainted then our choice of input can influence the condition. We would therefore like to have this condition represented in the formula we will generate to represent the path condition. We can achieve this be separating intructions into two categories; the set A, those that write the EFLAGS, and the set B, those that read the EFLAGS. Associated with every entry in A and B is the list of EFLAGS it writes or reads respectively. We represent the EFLAGS register using vector E <e_0, e_1, ... , e_31>. Each entry in this vector stores a list containing the operands of the last instruction to write this particular eflag. If the last instruction to write the eflag was not tainted by user input then the operand list will be empty.
(In practice each entry in the vector E is an object containing 1 to N vectors, where N is the number of operands of the instruction that wrote e. Each operand vector is of size 1 to DWORD_SIZE, with each entry representing the taint information for a single byte of that operand. The cost of instrumenting at the byte level is slower runtimes due to the extra overhead, but it means we gain accuracy and expressiveness in our formulae)
Using this infrastructure we can then begin to record the operands involved in instructions that write the EFLAGS register at runtime. Our instrumentation routine for every instruction ins looks something like this:
if ins in A operandList = ins.operandList; if (operandList.tainted) for eflag in ins.eflagsWritten: eflag.operandList = operandList; else for eflag in ins.eflagsWritten: eflag.operandList = ;
Once this algorithm is in place we can then process instructions from set B, those that read the EFLAGS register. For now I will just consider conditional jumps and exclude other instructions like cmpxchg, adc etc. The majority of eflag write/read combinations can be easily expressed by simply taking any of the eflags read by a given instruction, extracting the operand list (if there is one), and directly encoding the semantics of the condition on the operands. e.g. if the instruction is cmp x, y and the condition is jg, and the jump is taken, then we can express this as (> x y). To determine whether the jump has been taken or not, and hence whether the condition should be negated, we take an approach similar to that of Catchconv/Smartfuzz. That is, on the taken branch we simply insert a callback to our analysis routine with the condition and on the fallthrough path we insert a callback with the negated condition. Once this callback is triggered we basically have all the components required to store the condition for later processing – the operands involved, the conditional jump and whether the condition evaluated to true or false.
When we decide to generate an input it becomes necessary to convert these conditions to an SMT-LIB compliant formula. We first filter out the relevant conditions. Basically we only need include information on conditions if our changes to the original input might result in a different outcome. This is relatively simple to determine. For every condition we can extract all TaintByte objects that represent the taint information and dataflow history of a given byte. We can trace this back to a specific byte from user input and if we then decide to change this byte, the conditions it occur in will be included in the formula. SMT-LIB format has support for concatenation of bit-vectors so we generate our formula by concatenating the information on single bytes into words and double words where necessary. For example, if some of our input was treated as a 4 byte integer and compared to 0 we would end up with conditions like (not (= (concat n1 n2 n3 n4) bv0)), which expresses the concatenation of 4 bytes (which would earlier be declared as being of size 8 ) and the condition that these 4 bytes should not equal 0 when concatenated together. SMT-LIB bitvector arithmethic is modulo-X, where X is the size of the bit-vectors involved, and so it accurately models integer overflows/underflows.
(I mentioned that this approach works for most instructions. One case where some hackery is required is the test instruction. Typically something like test eax, eax is followed by a jz/je instruction. Obviously the intention here is that we are checking if eax is 0, not checking if eax == eax . This situation is easily detected by simply comparing the operands that set the zero flag on a jz/je for equality)
To demonstrate how these conditions can be gathered and used I have put together an annotated run of my prototype against a simple vulnerable program. It runs user input through a filter to ensure it is alphanumeric and if so a strcpy vulnerability is triggered. It can be found here. To get an idea of the kind of SMT constraints this generated I’ve also uploaded the entire formula here. The constraints that exist specifying every byte cannot be 0 are gathered from strcpy.