Monday, August 9, 2010

MSc Computer Science Dissertation


Introduction
1.1 Introduction
In this work we will consider the problem of automatic generation of exploits for software vulnerabilities. We
provide a formal definition for the term “exploit” in Chapter 2 but, informally, we can describe an exploit
as a program input that results in the execution of malicious code1. We define malicious code as a sequence
of bytes injected by an attacker into the program that subverts the security of the targeted system. This is
typically called shellcode. Exploits of this kind often take advantage of programmer errors relating to memory
management or variable typing in applications developed in C and C++. These errors can lead to bu er
overflows in which too much data is written to a memory bu er, resulting in the corruption of unintended
memory locations. An exploit will leverage this corruption to manipulate sensitive memory locations with
the aim of hijacking the control flow of the application.
Such exploits are typically built by hand and require manual analysis of the control flow of the application
and the manipulations it performs on input data. In applications that perform complex arithmetic
modifications or impose extensive conditions on the input this is a very di cult task. The task resembles
many problems to which automated program analysis techniques have been already been successfully applied
[38, 27, 14, 43, 29, 9, 10, 15]. Much of this research describes systems that consist of data-flow analysis in
combination with a decision procedure. Our approach extends techniques previously used in the context of
other program analysis problems and also encompasses a number of new algorithms for situations unique to
exploit generation.
1.2 Motivation
Due to constraints on time and programmer e ort it is necessary to triage software bugs into those that
are serious versus those that are relatively benign. In many cases security vulnerabilities are of critical
importance but it can be di cult to decide whether a bug is usable by an attacker for malicious purposes or
not. Crafting an exploit for a bug is often the only way to reliably determine if it is a security vulnerability.
This is not always feasible though as it can be a time consuming activity and requires low-level knowledge
of file formats, assembly code, operating system internals and CPU architecture. Without a mechanism
to create exploits developers risk misclassifying bugs. Classifying a security-relevant bug incorrectly could
result in customers being exposed to the risk for an extended period of time. On the other hand, classifying
a benign bug as security-relevant could slow down the development process and cause extensive delays as it
is investigated. As a result, there has been an increasing interest into techniques applicable to Automatic
Exploit Generation (AEG).
1We consider exploits for vulnerabilities resulting from memory corruption. Such vulnerabilities are among the most common
encountered in modern software. They are typically exploited by injecting malicious code and then redirecting execution to
that code. Other vulnerabililty types, such as those relating to design flaws or logic problems, are not considered here.
3
The challenge of AEG is to construct a program input that results in the execution of shellcode. As the
starting point for our approach we have decided to use a program input that is known to cause a crash.
Modern automated testing methods routinely generate many of these inputs in a testing session, each of
which must be manually inspected in order to determine the severity of the underlying bug.
Previous research on automated exploit generation has addressed the problem of generating inputs that
corrupt the CPU’s instruction pointer. This research is typically criticised by pointing out that crashing a
program is not the same as exploiting it [1]. Therefore, we believe it is necessary to take the AEG process a
step further and generate inputs that not only corrupt the instruction pointer but result in the execution of
shellcode. The primary aim of this work is to clarify the problems that are encountered when automatically
generating exploits that fit this description and to present the solutions we have developed.
We perform data-flow analysis over the path executed as a result of supplying a crash-causing input
to the program under test. The information gathered during data-flow analysis is then used to generate
propositional formulae that constrain the input to values that result in the execution of shellcode. We
motivate this approach by the observation that at a high level we are trying to answer the question “Is it
possible to change the test input in such a way that it executes attacker specified code?”. At its core, this
problem involves analysing how data is moved through program memory and what constraints are imposed
on it by conditional statements in the code.
1.3 Related Work
Previous work can be categorised by their approaches to data-flow analysis and their final result. On one
side is research based on techniques from program analysis and verification. These projects typically use
dynamic run-time instrumentation to perform data-flow analysis and then build formulae describing the
programs execution. While several papers have discussed how to use such techniques to corrupt the CPU’s
instruction pointer they do not discuss how this corruption is exploited to execute shellcode. Significant
challenges are encountered when one attempts to take this step from crashing the program to execution of
shellcode.
Alternatives to the above approach are demonstrated in tools from the security community [37, 28] that
use ad-hoc pattern matching in memory to relate the test input to the memory layout of the program at the
time of the crash. An exploit is then typically generated by using this information to complete a template.
This approach su ers from a number of problems as it ignores modifications and constraints applied to
program input. As a result it can produce both false positives and false negatives, without any information
as to why the exploit failed to work or failed to be generated.
The following are papers that deal directly with the problem of generating exploits:
(i) Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications - This paper [11]
is the closest academic paper, in terms of subject matter, to our work. An approach is proposed and
demonstrated that takes a program P and a patched version P0, and produces a sample input for P
that exercises the vulnerability patched in P0. Using the assumption that any new constraints added
by the patched version relate to the vulnerability they generate an input that violates these constraints
but passes all others along a path to the vulnerability point (e.g. the first out of bounds write). The
expected result of providing such an input to P is that it will trigger the vulnerability. Their approach
works on binary executables, using data-flow analysis to derive a path condition and then solving such
conditions using the decision procedure STP to produce a new program input.
As the generated program input is designed to violate the added constraints it will likely cause a
crash due to some form of memory corruption. The possibility of generating an exploit that results
in shellcode execution is largely ignored. In the evaluation a specific case in which the control flow
was successfully hijacked is given, but no description of how this would be automatically achieved is
described.
(ii) Convicting Exploitable Software Vulnerabilities: An E cient Input Provenance Based Approach - This
paper [35] again focuses on exploit generation but uses a “suspect input” as its starting point instead
4
of the di erences between two program binaries. Once again data-flow analysis is used to build a path
condition which is then used to generate a new input using a decision procedure. User interaction is
required to specify how to mutate input to meet certain path conditions. As in the previous case,
the challenges and benefits involved in generating an exploit that result in shellcode execution are not
discussed.
(iii) Byakugan - Byakugan [28] is an extension for the Windows debugger, WinDbg, that can search through
program memory attempt to match sequences of bytes from an input to those found in memory. It
can work with the Metasploit [39] tool to assist in generation of exploits. In terms of the desired end
result, this is similar to our approach although it su ers from the limitations of pattern matching.
When searching in memory the tool accounts for common modification to data such as converting to
upper/lower case and unicode encoding but will miss all others. It makes no attempt at tracking path
conditions and as a result can o er no guarantees on what parts of the input are safe to change and
still trigger the vulnerability.
(iv) Automated Exploit Development, The future of exploitation is here - This document [37] is a whitepaper
describing the techniques used in the Prototype-8 tool for automated exploit generation. The generation
of control flow hijacking exploits is the focus of the tool. This is achieved by attaching a debugger to
a running process and monitoring its execution for erroneous events as test cases are delivered to the
program. When such an event occurs the tool follows a static set of rules to create an exploit based
on what type of vulnerability was discovered (i.e. it distinguishes between stack and heap overflows).
These rules attempt to determine what parts of the input data overwrote what sensitive data and hence
may be used to gain control of the program execution. Once this is determined these values are used to
generate an exploit based on a template for the vulnerability type. No attempt is made to determine
constraints that may exist on this input or to customise the exploit template to pass these constraints.
(v) Automatic Discovery of API-Level Exploits - In this paper [25] a framework is presented to model the
details of the APIs provided by functions such as printf. Once the e ects of these API features have
been formalised they can be used in predicates to specifying conditions required for an exploit. These
predicates can then be automatically solved to provide API call sequences that exploit a vulnerability.
This approach is restricted to creating exploits where all required memory corruption can be introduced
via a single API, such as printf.
As well as the above papers, the BitBlaze project [50] has resulted in a number of papers that do not
deal explicitly with the generation of exploits but do solve related problems. Approaching the issue of
automatically generating signatures for vulnerabilities [9, 10] they describe a number of useful techniques
for gathering constraints up to a particular vulnerability point and using these constraints to describe data
that might constitute an exploit.
There is also extensive previous work on data-flow analysis, taint propagation, constraint solving and
symbolic execution. Combinations of these techniques to other ends, such as vulnerability discovery [27, 14],
dynamic exploit detection [43] and general program analysis [29] are now common.
1.4 Thesis
Our thesis is as follows:
Given an executable program and an input that causes it to crash there exists a sound algorithm to determine
if a control flow hijacking exploit is possible. If a control flow hijacking exploit is possible there exists
an algorithm that will automatically generate this exploit.
The purpose of this work is to investigate the above thesis and attempt to discover and implement a
satisfying algorithm. Due to the sheer number of ways in which a program may crash, and a vulnerability be
5
exploited, it is necessary to limit our research to a subset of the possible exploit types. In our investigation
we impose the following practical limits2:
1. Data derived from user input corrupts a stored instruction pointer, function pointer or the destination
location and source value of a write instruction.
2. Address space layout randomisation may be enabled on the system but no other exploit prevention
mechanisms are in place.
3. Shellcode is not automatically generated and must be provided to the exploit generation algorithm.

No comments: