Computer Tech: September 2008

What is decompilation?

Decompilation is the process of turning object code into source code. This should make sense, since compilation is the process of turning source code into object code. But what is object code? Roughly defined, object code is code expressed in a language that is executed directly by a real or virtual machine. For languages like C, the object code is generally run on a hardware CPU, while Java object code generally runs on a virtual machine.

Decompilation is hard

As described above, decompilation may sound fairly simple, but it's actually quite hard -- it essentially involves inferring large-scale, high-level behavior from small-scale, low-level behavior. To understand this intuitively, think of a computer program as a complex corporate structure. High-level managers dispense commands such as "maximize technology throughput" to their underlings, who then turn these commands into more concrete actions, such as installing the new XML database.

As a new employee at this corporation, you might ask an underling what he or she was doing, and get the response, "I'm installing a new XML database." From that statement, you wouldn't be able to infer that the final goal was to maximize technology throughput. After all, the ultimate goal could be something quite different, such as discretizing the supply chain structure or aggregating consumer data.

But if you were the curious type, you might ask a few more questions, and you might spread your questions around, asking several underlings at different levels of the company. Eventually, you could put all the answers together and you might be able to guess that the larger corporate goal was to maximize technology throughput.

If you think of a computer program working like a corporate structure, then the above analogy should give you an intuitive sense of why decompiling code isn't trivial. For a more theoretical perspective, here's how Cristina Cifuentes, a leading researcher in the area, describes the process of decompilation:
Any binary reengineering project requires the disassembly of the code stored in the binary file. From a theoretical point of view, the separation of data and code in von Neumann machines is equivalent to the halting problem; hence, a complete static translation is not possible. However, in practice, different techniques can be used to increase the percentage of code that is translated statically, or hooks to dynamic translation techniques can be used at run-time.
--"Binary Reengineering of Distributed Object Technology"

Turning object code into source code isn't the only issue at hand when it comes to decompilation. A Java class file potentially contains a number of different kinds of information. Knowing what kind of information a class file might contain is important to understanding what you can do with, and to, that information. And that's where a Java disassembler comes in.

Disassembling a class file

The actual binary format of the Java class file isn't important. What is important is knowing what different pieces of information are contained in those bytes. To this end, we'll employ a tool that comes with most JDKs -- javap. javap is a Java code disassembler, which is not the same as a decompiler. A disassembler turns object code that is in a machine-readable format (as shown in Listing 1) into something human-readable (as shown in Listing 2).

Listing 1. Raw contents of a class file

0000000 feca beba 0300 2d00 4200 0008 081f 3400
0000020 0008 073f 2c00 0007 0735 3600 0007 0737
0000040 3800 0007 0a39 0400 1500 000a 0007 0a15
0000060 0800 1600 000a 0008 0a17 0800 1800 0009
...

Listing 2.

Output of javapLocal variables for method void priv(int)
Foo this pc=0, length=35, slot=0
int argument pc=0, length=35, slot=1

Method void main(java.lang.String[])
0 new #4
3 invokespecial #10
6 return

Note that the output shown in Listing 2 isn't quite source code. The first half of the listing is a list of local variables for a method; the second half is assembly code, which is human-readable object code.

The elements of a class file

javap is used to disassemble, or unpack, a class file. Here's a quick run-down of the information contained in a Java class file, which can be disassembled using javap:
Member variables. Every class file contains all the naming and typing information for each data member of its class.
Disassembled methods. Each method of the class is represented, along with its type signature, by a string of virtual machines instructions.

Line number. Each section of each method is mapped to the source code line that it was generated from (when possible). This provides the run-time system, and debuggers, the ability to provide a stack trace for a running program.

Local variable names. The variables local to a method don't really need names once the methods are compiled, but they can be included using the -g option to the javac compiler. These, too, help the run-time system and the debugger to help you.

Now that we know a little bit about what's going on inside a Java class file, let's take a look as how we can tweak this information to our own ends.

Using a decompiler

Conceptually, using a decompiler is easy. It's the inverse of a compiler: you give it a .class file, and it gives you a source code file.

Some of the newer decompilers have intricate graphical interfaces, but we'll use Mocha, which was the first publicly available decompiler, for our initial examples. At the end of this article, I'll discuss JODE, one of the newer decompilers available under the GPL. (See Resources to download Mocha and for a list of Java decompilers.)

Let's say you have a class file called Foo.class sitting in your directory. Decompiling it with Mocha is as easy as typing the following command:
$ java mocha.Decompiler Foo.class

This produces a new file, called Foo.mocha (Mocha applies the name Foo.mocha to avoid overwriting any source code in the original file). The new file is a Java source file, and assuming all went well you can now compile it normally. Just rename it to Foo.java and go.

But there's a hitch: if you run Mocha on some code you have lying around, you may notice that the code it generates isn't identical to the source. I'll run an example so that you can see what I mean. The original source shown in Listing 3 is from a test program called Foo.java.

Listing 3. Snippet of original source for Foo.java

private int member = 10;

public Foo() {
int local = returnInteger();
System.out.println( "foo constructor" );
priv( local );
}

And here's the code generated by Mocha.

Listing 4. Mocha-generated source for Foo.java

private int member;

public Foo()
{
member = 10;
int local = returnInteger();
System.out.println("foo constructor");
priv(local);
}

The two snippets differ in where the member variable member is initialized to 10. In the original source, it's expressed as an initial value on the same line as the declaration; in the decompiled source, it's expressed as an assignment statement inside the constructor. The decompiled code tells us something about the way the original source was compiled; namely that its initial values were compiled as assignments in the constructor. You can learn a lot about the way Java compilers work by looking at decompilations of their output.

Decompilation is hard: Reprise

As valiantly as Mocha works to decompile your object code, it doesn't always succeed. No decompiler is able to render the source with complete accuracy, due to the difficulty of the problem, and every decompiler handles the holes in its rendering differently. For example, Mocha sometimes has trouble figuring out the exact structure of looping constructs. When this happens, it resorts to using fake goto statements in the resulting output, as shown in Listing 5.

Listing 5. Mocha fails to correctly decompile

if (i1 == i3) goto 214 else 138;
j3 = getSegment(i3).getZOrder();
if (j1 != 1) goto 177 else 154;
if (j3 > k2 && (!k1 || j3 <> j2)) goto 203 else 196;
expression 0
if == goto 201
continue;

Obfuscation to the rescue

Code obfuscation is literally the act of obscuring your code. A Java obfuscator changes a program in subtle ways, such that it behaves identically as far as the JVM is concerned, but is more confusing to the human trying to understand it.

Let's look at a sample of what happens when a decompiler runs up against obfuscated code. Listing 6 shows the result of Mocha's attempt to decompile Java code that had been obfuscated by a tool called jmangle. Note that the snip below is the same one we've used in the previous listings, though you certainly wouldn't think so at first glance.

Listing 6. Code obfuscated by jmangle

public Foo()
{
jm2 = 10;
int i = jm0();
System.out.println("foo constructor");
jm1(i);
}

An obfuscator such as jmangle changes many of the variable and method names (and sometimes even class and package names) into meaningless strings. This results in a program that is difficult for humans to read, but which is essentially the same as the original as far as the JVM is concerned.

Getting nasty

All obfuscators make symbols meaningless, but that's not all they do. Crema was notorious for the many nasty ways it could foil decompilation, and many of the obfuscators that have been created since then have followed its lead.

One popular way to obscure source is to take the meaningless-string trick to the next level, by replacing a symbol from the class file with an illegal string. The replacement might be a keyword like private, or, even worse, a completely meaningless symbol such as ***. Some virtual machines -- especially in browsers -- don't take kindly to such antics. Technically, a variable such as = is contrary to the Java specification; some virtual machines will overlook it, others will not.

Crema drops the bomb

Another one of Crema's tricks was, literally, the bomb. Crema was armed with the ability to completely shut Mocha down. It did so by adding a little "bomb" to the compiled code, which caused Mocha to crash when it attempted to decompile the code.

Sadly, Crema is no longer available, but a tool called HoseMocha, like Crema, was designed specifically to shut Mocha down. To see how HoseMocha works, we'll use our trusty disassembler, javap. Listing 7 shows the code before HoseMocha has planted its bomb.

Listing 7. Code before the bomb

Method void main(java.lang.String[])
0 new #4
3 invokespecial #10
6 return

And here's the code after HoseMocha has had its way with it.

Listing 8. Code after the bomb

Method void main(java.lang.String[])
0 new #4
3 invokespecial #10
6 return
7 pop

Do you see the bomb? Note that the routine now has a pop instruction after the return. But wait -- how can a function do something after it's returned? Obviously, it can't, and that's the trick. Placing an instruction after a return statement ensures that it will never be executed. What you see here is essentially impossible to decompile. It doesn't make any sense because it doesn't correspond to any possible Java source code.

But why does this little glitch cause Mocha to crash? It could just as easily ignore it, or send a warning and move on. While Mocha's vulnerability to this type of bomb could be considered a bug, it's more likely that it was deliberately created by van Vliet in response to all the concern about Mocha.

So far, we've looked at older decompilation and obfuscation tools -- oldies but goodies. But such tools have become increasingly sophisticated over the years, particularly with regard to graphical interface. We'll close with a look at one of the newer decompilers, just to give you an idea of what's out there.

i2 = i3;

Mocha's struggles aside, decompilers generally render source fairly accurately. Once you know a decompiler's weaknesses, you can manually analyze and tweak the decompiled code to come up with a fairly accurate representation of the original. Add to that the fact that decompilers are getting better all the time, and we have a bit of a problem: What if you don't want anyone to be able to decompile your code?

New kids on the block

Not only have decompilation and obfuscation techniques become more intricate over the past five years, but the interfaces to them are becoming increasingly slick. Several of the more recent decompilers let you browse through a directory of .class files and decompile them with a single click.

JODE (Java Optimize and Decompile Environment) is one such program. Type the name of a .jar file into the command line and JODE will allow you to graphically browse its classes, automatically decompiling each class for you to see. This is particularly useful for looking through the source code to libraries supplied with your Java SDK. Simply type in the following command:
$ java jode.swingui.Main --classpath [path to your Java SDK]/jre/lib/rt.jar

And you'll get a nice, smooth rendering of the file, as shown in Figure 1.

Figure 1. JODE: A decompiler

http://www.yworks.com/en/products_yguard_about.htm

Computer Tech

Saturday, September 13, 2008

Google Chrome Easter Egg Revealed

Tuesday, September 2, 2008

How to hide your Java code to prevent from being decompiled

FEEDJIT Live Traffic Feed

Blog Archive