PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

Share a file Manage my documents Convert Recover PDF Search Help Contact

paper11 .pdf

Original filename: paper11.pdf
Title: Microsoft Word - milcom-paper-v6.docx
Author: hira

This PDF 1.4 document has been generated by PScript5.dll Version 5.2.2 / Acrobat Distiller 8.0.0 (Windows), and has been sent on pdf-archive.com on 22/10/2015 at 00:30, from IP address 12.155.x.x. The current document download page has been viewed 407 times.
File size: 2.9 MB (6 pages).
Privacy: public file

Download original PDF file

Document preview



Detection of Global, Metamorphic Malware
Variants Using Control and Data Flow Analysis
Hira Agrawal, Lisa Bahler, Josephine Micallef, Shane Snyder, and Alexandr Virodov

Abstract—Current malware detection and classification tools
fail to adequately address variants that are generated
automatically using new polymorphic and metamorphic
transformation engines that can produce variants that bear no
resemblance to one another. Current approaches address this
problem by employing syntactic signatures that mimic the
underlying control structures such as call- and flow-graphs.
These techniques, however, are easily defeated using new
program diversification techniques. This hampers our ability to
defend against zero day attacks perpetrated by such auto
“replicating”, rapidly spreading malware variants. In this paper,
we present a new form of abstract malware signature generation
that is based on extracting semantic summaries of malware code
that is immune to most polymorphic and metamorphic
transformations. We also present results of our initial,
experimental evaluation of the proposed approach.
Index Terms—cyber security, malware detection, polymorphic
and metamorphic viruses and worms, program analysis



erpetration of self-mutating, fast-spreading malware that
continually evade most syntactic and heuristic signatures
designed specifically to detect and arrest them in their tracks is
one of the post pressing challenges in cyber security today.
We present a new form of malware abstraction analysis
technique that leverages existing control- and data-flow based
program analysis techniques to infer high level, semantic
signatures of such malware that are resistant to most
automated, polymorphic, and metamorphic transformations.
We are implementing the proposed approach into a tool called
Malware Abstraction Analysis (MAA), and have already
completed implementation of control-flow derived abstract
signatures. Initial experiments using this prototype with a
library of virus and worm variants from 12 malware families
showed that abstract signatures were able to detect 86% of
malware variants and 100% of benign programs based on
Manuscript received March 6, 2012.
This work was supported by US Army CECOM LCMC under Contract
No. W15P7T-11-C-A021. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors and do
not necessarily reflect the views of the CECOM LCMC.
Hira Agrawal (hagrawal@appcomsci.com), Lisa Bahler, Josephine
Micallef, and Alexandr Virodov are with Applied Communication Sciences,
Piscataway, NJ 08854, USA.
Shane Snyder is with US Army CERDEC, Aberdeen Proving Ground,

978-1-4673-3/12/$31.00 ©2013 IEEE

control-flow analysis alone. We expect that the detection rate
will only improve once we have also incorporated data-flow
analysis in this tool.
MAA is one of three subsystems of a larger Host Centric
Intrusion Detection and Reaction (HCIDR) system [1] we are
developing in collaboration with US Army CERDEC and
Purdue University [2]. The other two subsystems focus on
detection and remediation of user- and kernel-space exploits
and attacks. MAA, on the other hand, focuses solely on
detection of new, polymorphic and metamorphic variants of
existing malware that cannot be detected using existing
signature-based techniques. Given one instance of a malware,
polymorphic transformation engines can produce numerous
variants that implement the same logic but bear little
resemblance to one another or to the original instance.
Metamorphic engines, on the other hand, embed morphed
copies of themselves in each malware variant they produce, so
they can, effectively, self-mutate every time they spread from
one host to another.
Most state-of-the-practice malware tools rely on syntactic
signatures, such as code checksums and presence of specific
byte sequences, to locate and isolate malware from genuine,
legitimate code. These methods, however, are easily evaded
by polymorphic and metamorphic malware [3] that can
automatically and repeatedly morph themselves, so they can
no longer be caught using prior, existing signatures. Such
tools, therefore, are constantly required to keep their signature
databases up-to-date.
Many new, state-of-the-art techniques have been developed
for constructing higher level, semantic signatures [4]–[11] that
do not require exact matches for detecting malware instances.
They can, therefore, match multiple variants of the same
malware. These techniques, however, can address only a
subset of malware variants. Many of them, for example,
address only variants that are created using relatively simple
techniques like substituting one register for another in a block
of assembly instructions, replacing an operation such as “add”
with another equivalent operation such as “subtract” while
negating its operand, reordering certain instructions within a
block that do not interfere with one another, and inserting
redundant instructions that do not affect the outcome of the
computation involved, among others. Some of these
techniques also analyze higher-level representations of code
such as flow graphs [4] rather than raw bytes representing that
code. They can, therefore, accommodate small, local
polymorphic changes in malware code as long as they do not



Figure 1. A simple example illustrating control-flow analysis based abstract
signature derivation, comparison, and detection

significantly alter the overall structure of the flow graph
involved. They will, however, fail to spot variants that make
significant, but otherwise benign, changes to the branching
structures of that flow graph. Other techniques take a more
global view. Instead of examining flow graphs of individual
functions, they analyze their high level calling structure
[6][10]. They will, therefore, catch all variants that belong to
the same malware family as long as they do not drastically
alter the shape of the call graph involved. Creating variants
with significantly different call graphs, however, is fairly easy
[12]. The call graph based techniques too, therefore, will fail
to detect large sets of malware variants that are generated
automatically in this way. Our approach based on deriving
semantic summaries of malware, on the contrary, is resistant
to such global, large scale transformations.
We illustrate our proposed approach using a small example
in the next section. Then, in Section III, we briefly describe
the program analysis techniques we employ to derive abstract,
graphical signatures of binaries, and in Section IV, we discuss
how to compute distances between these signatures in order to
determine the nearest neighbor of a given binary in the
malware library, using an efficient technique that does not
require computing its distance from every binary in that
library. In Section V, we present the results of an experimental
evaluation of our initial, control-flow analysis based
implementation, before summarizing and outlining future
work in Section VI.
Figure 1 shows a small, illustrative example of how MAA
employs control flow analysis to derive high level signatures
of binaries. It depicts, on its top left, a simple “malware”
instance using, for ease of presentation, a high level pseudocode notation. It entails executing statements s1 and s2
whenever condition x evaluates to true. Otherwise, statements
s3 and s4 are executed. In either case, statement s5 is always
executed at the end. The right hand side of the figure depicts a
variant of this “malware”, where parts of its code have been

Figure 2. High level design and operation of MAA

broken off into four additional functions. This variant is
semantically equivalent to its original version, although,
syntactically, it is very different from that version. Current
signature based detection techniques that look, for example,
for specific byte sequences in target binaries are easily
defeated by such transformations. Techniques that compare
higher level abstractions such as flow graphs and call-graphs
of target binaries are also easily circumvented using such
transformations. Steps 1 and 3 in the figure depict derivation
of flow graph abstractions of the original version and its
variant, respectively. The two, again, look very different, and
techniques that compare flow graphs and/or call graphs of
binaries will fail to conclude that the two graphs represent the
same malware.
MAA, on the other hand, derives much higher level,
semantic representations of binaries that abstract away low
level syntactic details such as conditional and iterative control
flow as well as function calls and returns. Steps 2 and 4 in the
figure depict such derivations for the original version and its
variant, respectively. The two graphs, shown in Figures 1(c)
and 1(f), are similar to one another. The root node of each tree
asserts that statement s5 is always executed irrespective of the
outcome of condition x in each of the two versions. The two
children nodes of the root node assert that statements s1 and s2
are always executed together, and so are statements s3 and s4,
though not necessarily one after another; they may be
separated by additional control flow constructs such as
function calls and returns. The derived abstract representation
also captures the fact that execution of a statement at any node
implies execution of statements at all of its ancestor nodes in
the graph. It does not, however, imply execution of statements
at any of its descendent or sibling nodes. Execution of
statement s1 at node b, for example, implies execution of
statement s5, but not vice versa. Execution of the same



statement, s1 at node b, however, does not imply that of
statement s3 or s4 at node c, as expected. MAA automatically
derives such high level, abstract signatures form binaries and
uses them to correctly match different malware variants to one
Figure 2 shows the high level design and operation of
MAA. It entails two distinct phases: an offline analysis phase
when it computes abstract, semantic signatures of known
malware, and an online analysis phase when it computes the
analogous signature of a newly encountered binary and
compares it with known signatures to determine if it is a
variant of a known malware. The two phases largely mirror
one another in their two, initial steps:
a. Analyze and transform a given binary instance into an
abstract representation
b. Generate a high level, graph based, semantic signature
from that intermediate representation
The online analysis phase involves two additional steps:
c. Compare the graphical signature of the newly
encountered binary to those associated with previously
known categories of malware and assign a normalized
“distance” measure to each such comparison

Figure 3. A simple example illustrating derivation of control flow analysis
based abstract signatures.

d. Declare the new instance as malware if the distance
between that instance and its nearest neighbor in the
malware library is smaller than the threshold value
associated with that neighbor.

entry to the function exit includes any instruction in a super
block, then it must include all other instructions in that super
block, and all instructions in all ancestors of that super block
in the super block dominator graph. This graph represents an
abstraction of its control flow structure, as there is a many-toone relationship between flow graphs and super block
dominator graph.
Figures 3(c) to 3(e) show the intermediate steps used to
derive the super-block dominator graph of a flow graph [13].
This derivation starts with the construction of pre-dominator
tree [14], shown in Figure 3(c). This is followed by the
construction of the associated post-dominator tree, shown in
Figure 3(d), which is the same as the pre-dominator tree of the
corresponding reverse flow graph. Then the pre- and post
dominator trees are merged into a single graph, as shown in
Figures 3(e). This graph, unlike pre- and post-dominator trees,
may contain cycles. So the next step is to find all cycles in this
graph and collapse all nodes within a cycle into a single node,
that corresponds to a super block mentioned earlier.
Although the super block dominator graph is acyclic, it
may, in general, contain nodes with multiple parent nodes,
which may make graph comparisons more difficult. So MAA
derives a pure tree structure from it, called super block
dominator tree, by again building its pre-dominator tree. If the
super block dominator graph does not contain any nodes with
multiple parents, as is the case in Figure 3(f), it already
possesses a pure tree structure, and its super block dominator
tree is identical to itself.
Next, all predicate instructions in all super blocks in the
super block dominator tree are removed. If this step results in
any empty super blocks, then they are removed as well. The
resulting tree, shown in Figure 3(g), represents an abstract,
high level, semantic signature of the target function, where
most of the syntactic control flow specific artifacts have been

The next two sections describe how we derive high level
graphical signatures from binaries and how we compare such
MAA derives semantic signature of a binary instance in two
stages. First it analyzes all functions in that binary in isolation
and derives their local signatures by abstracting away all
“unnecessary” control flow artifacts from their flow graphs.
Then, it combines all local, function level signatures into a
single, global signature while abstracting away all call and
return specific artifacts.
Figure 3 depicts all steps involved in the first stage for the
simple example shown earlier in Figure 1(a) and reproduced
here in Figure 3(a). This stage starts with construction of the
flow graph of the given function, as shown in Figure 3(b).
This graph indicates that there are only two possible paths in
this function: aobod or aocod. Based on this information,
we can make multiple inferences: (i) If a is executed then d
must eventually be executed, (ii) if d is executed, a must have
been executed earlier, (iii) if a is executed then either b or c
must be executed, and (iv) if d is executed, either b or c must
have been executed earlier. These observations may be
inferred in an automated manner and captured in another
graphical structure, called super-block dominator graph,
which, for the flow graph in Figure 3(a), is shown in Figure
Each node in this graph is called a super block, and it
satisfies the following property: If a path from the function



Figure 5. Tree Edit Distance and triangle inequality

Figure 4. A simple example illustrating derivation of control flow analysis
based abstract signatures for individual functions

removed, while retaining essential semantic information about
which sets of instructions are always executed “together” as
well as some of their interdependencies. Once signature
graphs of all functions in a malware binary are constructed in
this manner, they are combined, based on the associated call
graph [15], into one global signature, as shown in Figure 1(f).
The final step in the abstract signature derivation involves
determination and elimination of nodes that represent
superfluous data flows, which do not affect the behavior of the
target binary. Figure 4(a) shows a variant of the function in
Figure 3(a), where some superfluous instructions have been
added to it. Figure 4(b) shows the signature of this function,
which is different from that in Figure 3(g), even though the
two functions involved are behaviorally equivalent. We
employ data flow analysis and program slicing techniques [16]
to determine and eliminate any superfluous instructions from
the control flow based signature derived earlier. The
highlighted instructions in Figure 4(c) represent the program
slice where all instructions that do not contribute towards the
program’s external behavior have been projected out. Figure
4(d) shows the new signature of this function, where all
instructions not in the program slice have been removed from
the control-flow based signature in Figure 4(b). Note that the
new signature matches that of the original version shown in
Figure 3(g). Global signatures are obtained in an analogous
manner by employing global, inter-procedural program slicing
techniques [17].
The techniques described in the previous section yield a graph
as the high level signature of a given binary. This graph is, in
fact, a rooted tree, where all nodes, except the root node, have
exactly one parent node, and there is exactly one distinguished
node, the root, which does not have a parent node. We have

made a deliberate choice to employ a tree, not a full-fledged
graph, as the signature of a binary, because a graph may, in
general, have multiple or no roots, contain cycles, and its
nodes may have multiple parent nodes, all of which make
comparing graphs a much harder problem.
To compare trees, we employ the well known concept of
tree edit distance [18]. Edit distance, De, between two trees, T1
and T2, is said to be the minimum number of edit operations—
delete node, insert node, and relabel node—required to
transform T1 into T2, where (a) deleting a node, u, with the
parent, v, causes children of u in to become children of v,
followed by removal of u from the tree, (b) inserting a node u
in place of ith to jth children of v, makes u the parent of those
children nodes and then makes u the ith child on v, and (c) relabeling a node u simply replaces the current label of u with
the given, new label. Consider, for example, the trees shown
in Figure 5. The edit distance between T2 and T3 is 1 as we
only need one edit operation, deleting the first child, v, of u, to
transform T2 into T3. Edit distance between T1 and T3,
however, is 2, and there are multiple edit scripts of length 2 to
achieve the same effect, (i) relabeling v to w and w to v, or (ii)
deleting v followed by inserting v as a child of w.
One problem with employing edit distance for comparing
malware signatures is that it does not take the relative sizes of
the trees involved into account. If we transform a 100 node
tree into a 101 node tree by inserting a single new node in it,
the edit distance between the two will be 1. If, on the other
hand, we transform a tree with one node into a tree with two
nodes by inserting a single node in it, the edit distance
between these two trees is also 1, although, in this case, a
single edit operation caused the tree size to be doubled,
compared to the previous case where the single operation
barely made a difference in the tree size. The standard
technique to address this problem is to employ a normalized
edit distance measure, Dn, which is obtained by dividing De by
the total number of nodes in the two trees. This means the
normalized edit distances for two cases mentioned above will
be 1/(1+2), or 0.33, and 1/(100+101), or 0.005. The
normalized edit distance, thus, does take relative sizes of the
trees involved into account, but it does so at the cost of



sacrificing another important property: it does not preserve the
triangle inequality property, which plays a crucial role in
reducing the number of tree comparisons required when
determining the nearest “neighbor” of a given binary in the
malware signature library.
We employ a different normalization measure, Dnm,
normalized metric edit distance, which preserves the triangle
inequality property; this scheme was proposed in the context
of string edit distance [19], which we have adapted for use
with tree edit distance:
Dnm(T1,T2) = 2*De(T1,T2) / (|T1| + |T2| + De(T1, T2))
Figure 5 illustrates the differences between all three
measures: De, Dn, and Dnm. Using the last measure, Dnm,
enables us to use the AESA algorithm (Approximating,
Eliminating, Search Algorithm) [20] to find the nearest
neighbor of the signature tree of a new binary instance among
signature trees of malware binaries in the library, to determine
if the new binary is a variant of one of them, as explained in
Section II, Figure 2. AESA algorithm leverages the triangle
inequality property of the distance measure to ensure that the
nearest neighbor of a new instance, among all instances in the
library, is determined using small number of tree comparisons,
and—more importantly—that number does not increase even
if the size of the library keeps growing.
To evaluate our proposed technique, we employed a library
of 76 malware variants from 12 worm/virus families,
including Bagle, Klez, Mimail, Mydoom, Netsky, and Roron,
among others. We initially assumed that malware instances
within the same family will be close variants of one another,
and hence their tree signatures will also be very similar. To
verify this assumption, we computed edit distance of each
malware binary from every other binary in the library, which
yielded a matrix of pair-wise edit distances. As we expected
variants within the same family to be similar, we converted
each distance measure, De, into a similarity measure, Se, using
the simple formula, Se = 1 – De, which converted the edit
distance matrix into a similarity matrix. We then sorted both
rows and columns of this matrix by malware family name. We
color coded entries of the similarity matrix with different
shades of gray, where darker shades represented higher
similarity. We expected entries within the sub-matrices
representing different families along the main diagonal of the
matrix to be very dark and entries outside those sub-matrices
to be very light. We, however, found that there were large
variations in color within several families, which means those
families contained distinct subfamilies, as shown in Figure 6.
For each subfamily mentioned above, we determined the
largest edit distance among all entries within that subfamily.
We call it the diameter of that subfamily. Table I lists all
families in the library along with their subfamilies and their
respective diameters. We use these diameters as thresholds
when determining if a new binary represents a variant of a
subfamily in the malware library. If the distance between the
new binary and its nearest neighbor in the malware library is

Figure 6. Similarity matrix of the test malware library

larger than the corresponding diameter, the new binary is not
considered a malware variant; otherwise, it is flagged as a
variant of that malware family.
Table I shows that six of the twelve families in the malware
library had a single instance, that is, they had no variants in
the library. Moreover, when we divided families into
subfamilies based on similarity “clusters” observed in the
similarity matrix, we found that certain subfamilies also
contained a single instance. We believe this is an artifact of
the relatively small size of our malware library.
As the diameter of a family/subfamily that contains a single
binary is always zero, no new binaries will ever match it
unless their signatures fully match its signature, that is, the
edit distance between the two signatures is zero. To avoid such
scenarios, we also employ a configurable minimum threshold
distance, with the default value of 0.2.
We then performed a five-fold cross validation experiment,
where we randomly divided our malware library into five sets.
Then we conducted five test runs, where one of these five sets
was used as the test set, and the other four made up the
malware library. We then used MAA to classify each binary in
the test set as either malware or benign, and, if it is classified
as malware, then to also identify its family/subfamily. Besides
binaries in the test sets, we also added four benign, nonmalware binaries to each test set to check if MAA ever
classified them as malware. Table II shows the results of each
test run.
Note that (i) every binary that was classified as benign was
in fact benign, that is the true-negative rate was 100%, (ii) in
no case, a benign binary was classified as malicious, that is,
the false positive rate was 0%, (iii) on an average, 12 out of
14 malicious binaries were classified as such, that is, the true
positive rate was 86%, and (iv) on an average, 2 out of 14
malicious binaries were misclassified as benign, implying a
false negative rate of 14%. Note that these results are based
solely on control-flow based abstract signatures. We expect
they will only improve when augment our signature
generation module with data-flow based analysis.







Present anti-virus tools use low level, syntactic signatures to
detect malware. These signatures are easily defeated by new,
automated polymorphic and metamorphic transformation
engines. Every time new variants that aren’t detected by
current signatures are discovered, anti-virus tool vendors have
to analyze those variants and develop new signatures to detect
them, and then distribute those signatures to all client
machines so they can detect those variants in future. High
level, semantic signatures we propose in this paper will, on the
contrary, enable many more new variants to be caught without
the need for continually updating existing signatures.
We plan to augment MAA by implementing data-flow
analysis discussed in Section III, which will lead to further
improvement in its malware detection rate. We also plan to
significantly increase the size of our malware test library. All
binaries in our current library are annotated with their
respective family names, which we regard as the ground truth.
It is, nevertheless, common to have errors in such annotations.
Another avenue for future research includes employing
automated machine learning techniques to determine natural
malware clusters, or families, and to compare them with the
associated, manual annotations; not having to rely on human
specified ground truth will substantially increase the utility of
our proposed technique.



T. Bowen and M. Little, “Integration of Diverse Sensors in the HostCentric Intrusion Detection and Reaction System,” submitted to IEEE
Military Communications Conference, MILCOM 2012.
D. M. Stanley, Z. Deng, D. Xu, R. Porter, and S. Snyder, “GuestTransparent Instruction Authentication for Self-Patching Kernels,”
submitted to IEEE Military Communications Conference, MILCOM
P. Szor and P. Ferrie, “Hunting for Metamorphic,” Virus Bulletin
Conference, 2001.
I. Briones and A. Gomez, “Graphs, Entropy and Grid Computing:
Automatic Comparison of Malware,” Virus Bulletin Conference, 2008.
D. Bruschi, L. Martignoni, and M. Monga, “Using Code Normalization
for Fighting Self-Mutating Malware,” IEEE International Symposium on
Secure Software Engineering, 2006.
E. Carrera and G. Erdélyi, “Digital Genome Mapping: Advanced Binary
Malware Analysis,” Virus Bulletin Conference, 2004.
M. R. Chouchane and A. Lakhotia, “Using Engine Signature to Detect
Metamorphic Malware,” ACM Workshop on Recurring Malcode, 2006.
T. Dullien and R. Rolles, “Graph-Based Comparison of Executable
Objects,” Actes du Symposium, SSTIC, 2005.
M. Gheorghescu, “An Automated Virus Classification System,” Virus
Bulletin Conference, 2005.
X. Hu, T. Chiueh, and K. G. Shin, “Large-scale malware indexing using
function-call graphs,” ACM Conference on Computer and
Communications Security, 2009, pp. 611-620.
C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna,
“Polymorphic Worm Detection Using Structural Information of
Executables,” Symposium on Recent Advances in Intrusion Detection,
G. R. Thompson and L. A. Flynn, “Polymorphic Malware Detection and
Identification via Context-Free Grammar Homomorphism,” Bell Labs
Technical Journal, vol. 12, no. 3, pp. 139-147, 2007.
H. Agrawal, “Dominators, Super Blocks, and Program Coverage,” ACM
Symposium on Principles of Programming Languages, 1994, pp. 25–34.
T. Lengauer and R. E. Tarjan, “ A Fast Algorithm for Finding
Dominators in a Flowgraph,” ACM Transactions on Programming
Languages and Systems, vol. 1, no. 1, pp. 121-141, 1979.
H. Agrawal, “Efficient Coverage Testing Using Global Dominator
Graphs,” ACM Workshop on Program Analysis Tools and Engineering,
1999, pp. 11–20.
M. Weiser, “Program Slicing,” IEEE Transactions on Software
Engineering, vol. 10, no. 4, 1984, pp. 352-357.
S. Horwitz, T. Reps, and D. Binkley, “Interprocedural Slicing Using
Dependence Graphs,” ACM Transactions on Programming Languages
and Systems, vol. 12, no. 1, 1990, pp. 26-60.
K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing
Distance Between Trees and Related Problems,” SIAM Journal of
Computing, vol. 18, no. 6, pp. 1245-1262, 1989.
L. Yujian and L. Bo, “A Normalized Levenshtein Distance Metric,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol,
29, no.6, pp 1091-1095, 2007.
E. Vidal, “New Formulation and Improvements of the NearestNeighbour Approximating and Eliminating Search Algorithm (AESA),”
Pattern Recognition Letters, vol. 15, no. 1, pp. 1-7, 1994.

Related documents

genetic algorithms paper
13n13 ijaet0313544 revised

Related keywords