Code Obfuscation - Part 2: Obfuscating Data Structures

Paladion
By Paladion

September 15, 2005

Continuing from the earlier parts in this article series, Sonali looks at data obfuscation, a class of obfuscation techniques that targets the data structures in a program. She presents the different methods of data obfuscation with examples and also analyzes their quality

Code Obfuscation

In the August issue of Palisade we introduced
code obfuscation
to protect software from reverse engineering. Obfuscation
is the process of transforming the source code to confuse both a person or
a program trying to reverse engineer the code. Source code is obfuscated
using obfuscation software like Dotfuscator .

In this issue we look at data obfuscation, a class of obfuscation techniques
that targets the data structures in a program. We present the different methods
of data obfuscation with examples and also analyze their quality based on the
parameters we defined earlier. These parameters were potency -- the ability
to confuse a human reverse engineer, resilience -- the ability to fool a deobfuscator,
and cost -- the increase in execution time and memory required in the obfuscated
code.

Before we go into the nitty gritty details, here's an outline of the various
techniques we will discuss:

  • Obfuscating Arrays
    • Split , Merge, Flatten, Fold, Reorder arrays
  • Obfuscating Classes
    • Split a class, Insert new class, Reorder members
  • Obfuscating Variables
    • Substitute code for static data, Merge, Promote, Ruffle values, Split
      variables

Obfuscating Arrays

To start with, let's look at the different ways to make arrays difficult to
read and make sense of. The objective is to confuse a human being or a deobfuscator
from understanding the true nature and purpose of the array.

First, an array could be split into multiple sub-arrays to
confuse the reader about its structure. For example, consider an array which
holds a list of names: if we split it into two arrays, it will be interpreted
as two different lists of names.

Array 01

A related idea is to merge multiple arrays into one, like
this:

Array 02

We could even fold an array, i.e. increase the number of
dimensions…

Array 03

…Or flatten an array, i.e. reduce the number of dimensions.

Array 04

Each of the above makes the array more difficult to interpret. Observe that
array splitting and folding increase the complexity of the code. They thus
confuse the reverse engineer, i.e. increase the potency of the transform. On
the other hand, array merging and flattening decrease the complexity. But they
increase the obscurity of a program because they introduce bogus structure
or remove structure from the original program.

Another method for obfuscating arrays is to reorder elements within
the array. A mapping function f can be defined such that the ith element
in the original array is relocated to the jth position where j
= f(i)
. This function can be used for the mapping of elements in the
original and reordered array. Reverse engineers can figure out what's going
on here, so this has low potency and resilience; but these are also free since
there is no change in memory or execution time associated with shuffling the
positions of array elements.

Obfuscating Classes

Now that we have seen how arrays can be obfuscated, let's look at obfuscating
more complex data structures: classes. Our objective is to confuse the reverse
engineer from making sense of the exact working of the class.

The complexity of a class increases with its depth, i.e. distance from the
root of the inheritance tree. Intuitively, you'd have guessed that increasing
the depth and the number of descendants for a class are the means to obfuscate
a class.

Thus to start with, a class can be split into multiple classes,
each of which is derived from the previous class. In the example below we split
class C into two classes, a base class C1 which has some of the members of
C and a class C2 which is inherited from C1 and contains the rest of the members
of C. The objects of class C in the original code would be substituted by objects
of class C2 in the obfuscated code.

Code 01

The complexity of a class also increases with the number of its direct descendants.
So we could confuse a reverse engineer by inserting new bogus classes .
In the example below we start with a class C1 and its child C2. For obfuscation,
we introduce a bogus class C3 which is inherited from C1 and then inherit C2
from this new class. The members that C3 introduces, i.e. variable v3 and method
M3() are bogus methods introduced for obfuscation.

Code 02

These methods have medium potency; but the resilience is low because a deobfuscator
can easily merge the split classes or remove the bogus classes. To increase
the resilience, splitting and insertion are often combined in practice. Also,
objects of the new class are instantiated and their methods invoked to fool
the deobfuscator from recognizing these as bogus classes. The cost associated
with these transformations is low.

Yet another method to obfuscate classes is to play with the ordering of variables.
Since source code is mostly organized to maximize its locality, logically related
items occur near each other. Looking at the order in which variables, methods,
etc. are declared can prove useful for reverse engineering. Hence, changing
the ordering
of methods and instance variables also makes a class
more obscure.

Obfuscating Variables

Let's now look at a class of techniques to obfuscate the humble variable.

The simplest technique is to substitute a variable with code .
Static data such as character strings can be substituted with a piece of code
that generates it dynamically. To confuse the reverse engineer further, this
code could also generate other extraneous, nonsense strings. The potency and
resilience increases a lot if we use multiple small functions to compute the
static strings, and if these functions are then scattered in the normal control
flow of the program.

Here's a simple example of how the literal string "abc" can be obfuscated:

Code 03

A second method is to merge multiple variables into a single
variable assuming the combined range of these variables fit into the target
variable. For example, two 32-bit integers X and Y can be merged into a 64-bit
long variable V using the formula V = (232 *Y + X). Y occupies the
first 32 bits and X the next 32 bits of the 64-bit variable. The resilience
in this case is low because a deobfuscator can also guess that the variable
V consists of two merged variables by examining the arithmetic operations.
A variation on this theme is to merge variables v1, v2, …,
vk into an array V[k] of an appropriate type (a type that can hold
each of these different variables). For example, if the variables are of type
short integer, integer and long integer, they can be merged into an array of
type long integer.

Another interesting technique is to change the scope of a variable. Transform
a local variable to a global variable, for example.

Code 04

Suppose we have functions F1() and F2() both using a local variable i and
F1 and F2 do not execute at the same time, the variable i can be promoted
to a global variable and shared by the two. It takes the reverse engineer longer
to figure out the exact meaning of that global variable, that it is actually
two independent variables hidden behind one.

Variables can also be promoted from a specialized storage
class to a more generic one to increase complexity. For example, an integer
variable can be transformed into an integer object.

Code 05

While each of the above transforms are not very effective on there own, they
can be useful when combined with other transforms.

A more powerful technique is to ruffle the values directly.
An example is replacing a variable i by a derived value c1*i +c2 .
Suppose we choose c1 = 4, c2 = 1. In this transformation again the potency,
resilience as well as cost increases with increasing complexity of the transformation
chosen.

Code 06

A very effective obfuscation technique for variables which take a fixed range
of values (like Boolean variables) is to split them into multiple
variables of different types. The idea here is to confuse the reverse engineer
by disguising the actual number of variables being used as well as their data
type - the split variables can be different in type form the original variable.
This is the last and also most complex technique we discuss today, so let's
go step by step.

To split a variable V into two variables p and q, we need to define:

  • A function that maps p and q to V, V = f(p,q)
  • A reverse function g(V) that maps V to p and q
  • New operations defined on p and q that correspond to the operations on
    V

For instance, let's split a Boolean variable V into two short integer variables
p and q. We design that p and q take only integer values 0 or 1. Further, we
define that p and q having the same value i.e. (p = 0, q = 0) or (p = 1, q
= 1) corresponds to V = false and p and q having different values i.e. (p =
0, q = 1) or (p = 1, q = 0) corresponds to V = true.

We can then define the functions f(p,q) and g(V) as follows:

g(V)

f(p,q)

V

p q
0 0 False
0 1 True
1 0 True
1 1 False

Next we define counterparts for operations on the boolean variable V in terms
of p and q. These can be specified in the form of a lookup table for each operator.
So all logical operations on the Boolean V are masked as arithmetic operations
on the integers p and q. That's a good way to confuse the reverse engineer.
The potency and resilience of the transformation increases with increase in
the number of variables we split the original variable into, but so does the
cost.

Summary

We have looked at different data obfuscation techniques in this article. Let's
quickly recap how effective each of these techniques are in terms of Potency,
Resilience and Cost.

Transform

Potency

Resilience

Cost

Obfuscating Arrays

Split array

Varies

Weak

Free

Merge array

Varies

Weak

Free

Flatten array

Varies

Weak

Free

Fold array

Varies

Weak

Free

Reorder array

Low

Weak

Cheap

Obfuscating Classes

Split class

Medium

Low

Free

Insert class

Medium

Low

Free

Reorder class members

Low

One-way

Free

Obfuscating Variables

Substitute static data with
code

Varies with complexity
of function

Merge variables

Low

Weak

Free

Promote variables

Low

Strong

Free

Ruffle the value

Varies with complexity
of transform

Split variables

Varies with number
of variables

What's next in Code Obfuscation?

Data obfuscation does not make your programs "fool-proof" against reverse
engineering; but it adds a second level of defense. The job of the reverse
engineer or deobfuscator becomes so much more difficult now. Next month, we'll
look at Control Obfuscation, another group of interesting techniques to making
your code secure from reverse engineering.


Tags: Technical

About

Paladion

SUBSCRIBE TO OUR BLOG

Buyers-Guide-Collateral

WHITEPAPER

Buyer’s Guide to Managed Detection and Response

Download
MDR

Get AI Powered

Managed Detection and Response

MDR-learmore-btn

 

MDR-Guide-Collateral

REPORT

AI-Driven Managed Detection and Response

Download Report
Episode

EPISODE-25

Red-LineAsset-6

Why Your ‘Likes’ on Facebook May Be Revealing Far More than You Thought

Click URL in the Post for the Full Podacst
  • FacebookAsset
  • LinkedinAsset
  • TwitterAsset