Formally verified x86-64 decompilation
The FoxDec (for Formal x86-64 Decompilation) project is investigating formally verified decompilation of x86-64 binaries that is recompilable, patchable, and validatable. A decompiler extracts higher-level code, such as an intermediate representation (IR), from a binary, and is a critical tool for enhancing software security when the source code is unavailable. Recompilability implies that the produced IR can be compiled back into a binary. Patchability refers to the ability to perform transformations. This typically requires symbolization: instruction addresses as well as the addresses of global variables, external functions, and data sections need to be replaced with labels. Validatability ensures that it is possible to check whether the produced IR is a semantically sound representation of the original binary by applying formal (e.g., theorem proving) and non-formal (e.g., fuzzing) methods.
The motivation behind decompilation with these three properties is multifold. First, it enables a decompile-patch-recompile workflow. Symbolization ensures that at recompile time, instructions and sections can be laid out by the recompiler, a prerequisite for making any modifications, such as inserting instructions or replacing functions. Second, it aids in the trustworthiness of the lifted IR. Recompilability enables the creation of an IR that is executable and can therefore be tested. We argue that even if an IR is formally proven to be a correct representation of the original binary, there is still significant value in testing.
FoxDec is currently actively being developed. The latest version uses Netwide Assembler (NASM) as the IR, while a previous version lifted C code.
FoxDec enables multiple use cases for enhancing software security: i) formally verify memory safety properties, ii) enable trustworthy binary patching, and iii) enable trustworthy binary hardening.
NEWS
The following figure illustrates FoxDec’s design, which formally verifies the lifting of a binary executable B0 to a machine-independent intermediate representation (MIIR). A key element of this design is that it enables a formal proof of correctness of the lifted MIIR. The formal verification approach is called translation validation. With this approach, a lifter is used to produce an MIIR from a binary. The MIIR is recompiled into a binary executable, Br. We now state that the lift is done trustworthily, if it can be proven that B0 and Br are semantically equivalent.
The advantage of this approach is that it removes both the lifter and the compiler from the trusted code base. That means that we are free to implement both the lifter and the compiler in any way we like. As long as the two resulting binaries can be proven to be semantically equivalent, the MIIR is a correct higher-level representation of the original binary.
To prove that binary executables B0 and Br are semantically equivalent, we use one of the state-of-the-art binary lifters: Ghidra. Ghidra converts both binaries, B0 and Br, to P-code, which operates at the same level of abstraction as assembly code. During lifting, we keep track of various observations (denoted as γ0) made over the original binary executable B0, which were used to generate the MIIR. The combination of the produced P-codes with the observations γ0 is used to build a certificate. That certificate contains a series of propositions, such that if all these propositions are true, then the two binaries are semantically equivalent. Note that this approach does not rely on Ghidra as a decompiler to produce source code but uses it solely as a disassembler.
To establish the proof, we utilize the Isabelle/HOL theorem prover. Isabelle/HOL takes propositions as input and attempts to prove that they are true by breaking down the proof into elementary reasoning steps that abide by the fundamental rules of mathematical logic. FoxDec generates the certificate in such a way that i) it is readable by Isabelle/HOL, and ii) all its true propositions can be proven fully automatically. A false proposition is unprovable and would indicate that something went wrong during lifting. A proven certificate completes the translation validation.
FoxDec’s approach for verifying a program’s memory safety properties involves reasoning about all possible control flow transitions of the program, including dynamically computed control flow transitions (e.g., indirect branches), invariants that must hold for those transitions to occur (e.g., bounds for register values), and how memory is modified during instruction execution (e.g., changes in the relationships between memory regions).
FoxDec disassembles binary code and constructs a state machine-like abstraction that enables such reasoning. A state represents a program state in the form of invariants and an abstract model of memory that must hold after an instruction is executed. The transitions model control-flow. The invariants thus naturally form an instruction’s pre- and post-conditions, and each transition therefore becomes a Hoare triple. The abstraction is consequently called a Hoare Graph, whose transitions represent all possible instruction executions and control flow transfers. The graph is constructed by symbolically executing each instruction using formal instruction semantics — our projects Chum and libLISA are investigating the formalization of instruction semantics — and keeping track of, and updating the invariants and the memory model. A join operator merges states of the same instruction (e.g., due to loops) by constructing least upper bounds, constituting a join semi-lattice.
The Hoare Graph’s correctness, i.e., it is provably overapproximative in that it represents all possible instruction executions and control flow transfers, is formally proven in a theorem prover (i.e., Isabelle/HOL theorem prover) by proving each Hoare triple. This removes the graph construction algorithm and its implementation from the trust base.
FoxDec can verify a class of memory safety properties, including return address integrity, bounded control flow, and adherence to the x86-64 System V ABI calling conventions. The verification approach is highly scalable, with an almost linear relationship between the number of states constructed (which approximates the number of computational steps) and the number of instructions in a binary, primarily due to the join operator, which prevents reasoning about the same instruction multiple times.
Decompilation to a high-level language involves multiple phases. At a high level, the phases typically include disassembly, which extracts assembly code from binaries; control flow graph (CFG) recovery, which recovers the program CFG from assembly; extraction of high-level program constructs (e.g., statements, variables, references) from assembly; and type assignment. FoxDec is investigating techniques for the formally verified decompilation phases.
FoxDec’s decompilation phases include disassembly, CFG recovery, extraction of an abstract code that models a program as a CFG of basic blocks; converting basic blocks into sequential code that models the program’s corresponding state changes over memory, registers, and flags; variable analysis that maps memory regions to variables and references; and type analysis that assigns types. Converting basic blocks into sequential code that captures program state changes requires a formal model of the underlying machine (i.e., formal semantics of x86-64 instructions). Our projects Chum and libLISA are investigating the formalization of instruction semantics.
Central to formally verified decompilation is the concept of sound decompilation. FoxDec defines soundness for each of these decompilation phases (sound disassembly is explored in our DSV project) and formally verifies them: algorithms for each phase are formalized in the Isabelle/HOL theorem prover and proven correct.
Use cases. Sound, recompilable decompilation to C has a variety of use cases. For example, patching a binary to fix errors or potential security exploits is highly complex in settings where the source code or third-party libraries are no longer available, or the build processes or tools have become outdated. Patching at the C-level is relatively more straightforward and a compelling alternative when the decompiled C code is formally proven to be functionally equivalent to the binary.
Other use cases include binary analysis, binary porting (as an alternative to software emulation), and binary optimization, each of which can now be performed at the C level (e.g., using off-the-shelf C code analysis and optimization tools).
FoxDec enables enhancing software security in three different ways:
FoxDec can verify a class of memory safety properties, including return address integrity, bounded control flow, and adherence to calling conventions. These properties are also established through formal proofs using the Isabelle/HOL theorem prover. FoxDec was demonstrated to verify these properties of the following programs:
FoxDec can be used for trustworthy binary patching, such as replacing components of a binary with enhanced versions, which may be needed due to bug fixes, replacing deprecated code/library, or fixing vulnerabilities discovered in the original component. FoxDec’s trustworthy binary patching capabilities were demonstrated on the binary executable of the World Wide Web Get (wget) program, which is used in many systems to download files from the Internet, supporting the most widely used Internet protocols (e.g., HTTP and FTP). The capabilities were demonstrated in the following ways:
wget
supports HTTP connections, despite them being considered insecure. FoxDec was used to patch wget
so that it only supports HTTPS, the secure variant of HTTP.wget
uses (i.e., OpenSSL) with a new version (oqsprovider) that uses quantum-safe cryptography.wget
’s function do_conversion()
can cause a buffer overflow. For rapid remediation, FoxDec was used to patch wget
so that it terminates whenever this security vulnerability occurs.FoxDec can be used for trustworthy binary hardening. Example use-cases include:
FoxDec’s current limitations include the following:
Download FoxDec here. This will use Docker to build and run FoxDec. Section How to use contains further instructions. The GitHub page is here.
NOTE: instructions for building without Docker can be found here (only relevant for developers).
These are instructions for a quickstart on the already supplied binary wc
.
./binary/
. The binary wc
has already been supplied as running example../foxdec.sh wc
./artifacts/
is now populated with information.The following files are generated:
$NAME.json
and $NAME.json.txt
: Contain disassembled instructions, control flow recovery, function boundaries, invariants, pointer analysis results. The two files contain the exact same information, one in JSON format and the other in humanly readable format.$NAME.metrics.txt
: a log containing metrics such as running time, number of covered instructions, accuracy of pointer analysis, etc. The two files contain the exact same information, one in JSON format and the other in humanly readable format.$NAME.callgraph.dot
: A Graphviz .dot
file containing the ACG (Annotated Call Graph), annotated with verification conditions necessary to ensure "normal" behavior (e.g., no stack overflows, calling convention adherence).nasm/$NAME.asm
: recompilable NASM code.functions/$ENTRY/$NAME.dot
: For each function entry $ENTRY
a control flow graph (CFG) and a log containing a pre- and a postcondition.NOTE: it is possible to manually add entry points, by adding a plain-text file ./binary/$NAME.entry
where each line is an entry address in hexadecimal notation.
The following instructions recompile the generated NASM code back into a binary.
nasm -felf64 -g -F dwarf -o $NAME.o $NAME.asm
gcc -c __gmon_start__.c -o __gmon_start__.o
gcc -g -m64 -nostartfiles -fgnu-tm -o a.out $NAME.o __gmon_start__.o
For the last compilation step, one must manually supply additional libraries as needed.
FoxDec is an open-source project from the Systems Software Research Group (SSRG) at Virginia Tech. It is supported by the Defense Advanced Research Projects Agency (DARPA) and Naval Information Warfare Center Pacific (NIWC Pacific) under Contract No. N66001-21-C-4028, DARPA under Agreement No. HR.00112090028, DARPA and the Army Contracting Command Aberdeen Proving Grounds (ACC-APG) under Prime Contract No. W912CG23C0024, DARPA under Prime Contract No. HR001124C0492, by the US Office of Naval Research (ONR) under grant N00014-17-1-2297, and by the US Naval Surface Warfare Center Dahlgren Division/Naval Engineering Education Consortium (NEEC) under grants N00174-20-1-0009 and N00174-16-C-0018.