An SMT solver (such as Z3, Yices or STP) is a decision procedure that can handle various types of arithmetic and other decidable theories. They make use of procedures specific to the theories they handle, such as linear arithmetic, in combination with the brute force power of a SAT solver. At a high level, proceed by iteratively by replacing the sub-expressions in a formula like (x + y < 10 || x > 9) && (y + z < 5) with propositional variables, to give something like (A || B) && (C). At this point we can apply a SAT solver to look for a solution. If none exists then we need not bother with analysing the abstracted expressions, as the formula is unsatisfiable. If the SAT solver finds a satisfiable solution we then restore the original expressions and pass the conjunction of this off to the core decision procedure which can deal with the semantics of the given theory. This process then proceeds in the standard DPLL method of iteration, involving finding UNSAT cores and backtracking. One of the best collection of resources on how they actually work is Leonardo De Moura’s MSR research page.
So, theory aside, how can we use this to entertain ourselves/do useful things? Well, an advantage of an SMT solver over a regular SAT solver is that we can quite easily express the sort of operations that tend to go on during program execution. We can model conditions, arithmetic and arrays. Using a theory that can model operations of this kind we can represent a path through a program (or several different paths in fact). By appending extra constraints we can then use an SMT solver to generate inputs that will take different conditional branches (useful for guiding a fuzzer), ensure memory locations have a specific value (useful for avoiding shellcode filters) and/or model conditions we want to check for, such as integer overflows.
A practical example may help illuminate why you might want to use an SMT solver. Consider the problem of a program containing an exploitable vulnerability where we have located a suitable buffer for shellcode but we now need to know what input to provide such that the desired shellcode is in that buffer when we jump to it. One solution is to use dynamic analysis to gather the entire path condition over user influenced data and then append the condition that the buffer we are interested in contains the shellcode. To do this we will gather all data movement and conditional instructions on tainted data (I won’t discuss how to do this here). At the vulnerability point we can then express this trace as the input to an SMT solver. Most solvers have their own API but they should also all accept a SMT-LIB formatted text file input, which is what I use so as not to tie myself to a single solver. You can read more about the format here, but essentially our input will have 4 sections.
Suppose, for the sake of having a tractable example, that our vulnerable program just takes two bytes of input and moves them once into a new location, and we want to determine what input to provide such that these new locations have the values 0x41 and 0x42. Our specification to the solver would then proceed as follows:
We begin with a pretty standard header that is basically static unless you want to use a different logic. I use quantified bitvector logic because it is easier to model mod32 arithmetic and data movement/conditionals at the byte and bit level than it would be with linear arithmetic or another logic.
(benchmark exploitSample :status unknown :logic QF_BV
Following this we then specify the name and type of all variables that we intend to use in the formula itself. The format of valid names is discussed in a document linked from the SMT-LIB website, but approximately follows the same rules as C.
Here I declare 4 variables, <i0, i1, n256, n257> (names unimportant), and I declare them to be bitvectors of site 8 e.g. they model a single byte
:extrafuns ((n256 BitVec)(i0 BitVec)(n257 BitVec)(i1 BitVec))
Next is the assumptions section. Here we can specify the path condition, i.e. all data movement and conditionals that occured during the dynamic trace. We could just as easily express these in the next section but for ease of comprehension I use the assumptions section (according to the docs some parsers might also work slightly faster this way).
The format of this section should be familiar with anyone that has dabbled in programming languages that are list orientated. It is pretty much (operator operand1 operand2 ....)
This assumption is basically the expression of the conjuction (n256 := i0) AND (n257 := i1)
:assumption (and (= n256 i0)(= n257 i1)
Finally, we have our formula. This is of the same format as the assumption section, but I usually use it to express the part of the problem I want to solve. e.g. I encode the shellcode here.
(The form for a bitvector constant is bvDECIMAL_VAL[SIZE_OF_VECTOR])
:formula (and (= n256 bv65)(= n257 bv66))
Obviously, this is a trivial example and we can quite easily spot the solution, but in a situation where there are 30+ movements of every byte of input, as well as conditionals and arithmetic, it quickly becomes impossible to solve by hand. I’ve used this technique to produce satisfying inputs for formulae over hundreds of input variables in a matter of seconds. As for the actual running, we can concatenate the above sections into a text file (probably best to use a .smt extension as some solvers seem to look for it) and invoke our solver to get something like the following:
nnp@test:~/Desktop/yices-1.0.21/bin/$ yices -smt -e < constraints.smt sat (= i0 0b01000001) (= i1 0b01000010)
Which can be parsed back to a C/Python/etc array using a relatively simple script.
This approach is exactly what I’m doing to auto-generate my exploits. In the assumptions, I specify everything I’ve traced during program execution, and in the formula I specify the shellcode and the desired location I want to use it in (determined by the method described in a previous post of mine), as well as the address of a shellcode trampoline. By also including the information on what the original user input was, the SMT solver can produce an output with the exploit information at the desired indexes and valid input at all others. The finished product is currently something that looks like this:
exploitArray = ['\x99\x31\xc0\x52\x68\x6e\x2f\x73\x68\x68\x2f\x2f\x62\x69\x89\ xe3\x52\x53\x89\xe1\xb0\x0b\xcd\x80\x58\x58\x58\x58\x58\x58\x58\x58\x42 \x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x42 ... 42\x42\x42\x42\x42\x42\x42\x42\x42\x42\x39\x84\x04\x08\x42\x42\x42\x42 ... '] oFile = open('pyExploit.in', 'w') oFile.write(''.join(exploitArray)) oFile.close()
Which was produced from an input file containing only ‘B’s, the required shellcode and a set of potential trampoline addresses. The tool has determined that at the vulnerability point the ESP register points to a buffer containing data from the start of the user input, and thus has filled in the start of the input with the shellcode. It has also found the correct bytes that overwrite the EIP and replaced them with the address of a jmp %esp (0x08048439)