Skip to content

Pair Reverse Engineering

Two days ago I had the pleasant experience to participate in some kind of informal reverse engineering session with three other guys. Between dinner and way too long after midnight we debugged a popular piece of malware that is floating around the internet right now. The first guy already reverse-engineered an earlier version of the malware. He was the guy in charge who did most of the debugging. The second guy was the author of a program that monitors and logs the behaviour of processes, especially malware processes. The goal of the session was to find out why the malware sample worked perfectly in VMWare (after we patched out the VMWare check, at least) but crashed as soon as second guy's monitoring tool was active. The third guy was very familiar with the malware too but on a higher level (behaviour, network activity, how it spreads, its historic development and usage, ...). I was the fourth guy. Without a direct interest in the malware or the malware monitoring tool I just wanted to see what goes wrong. Furthermore I was the guy for snarky comments from the background like "see, I told you take a VMWare snapshot before stepping over that call".

Anyway, so much for the introduction. This was not the first time I debugged binaries with someone else, but in the past I always had the keyboard. This time I staid in the background and observed what happened. Primarily a software developer and only a hobbyist reverse engineer, I compared what I saw to pair programming where two people sit in front of the same computer and write code together. While I believe that pair programming is at least moderately useful, I got the impression that there are serious problems with pair reverse engineering (or quad reverse engineering).

1. Very loosely specified goals

In most cases, the goal of a reverse engineering project or a reverse engineering session is defined incredibly loosely. Popular examples of reverse engineering goals, especially in the realm of malware analysis, are "unpack file X" or "find out what file Y does" or "find out how network protocol Z works". This is extremely vague. The programming equivalent would be some kind of project that has no specification beyond "implement a text editor". If my boss approached me with a project specification like that I'd probably think "awesome, another project where I can do what I want" and get coding but in reality I should ask for a much more precise specification. Anything else can only end in tears. This leads me directly towards my next point.

2. The difficulty of planning reverse engineering projects in advance

The "implement a text editor" specification is very vague. Most new software projects (should) have more precise requirements. In theory it would be possible to write a proper specification of a text editor and its requirements before writing a single line of code. Sure, surprises and exceptions always happen and the initial requirements list is never perfectly equal to what you want at the end of a project. The basic idea is simply that it is in theory possible to create a fine-grained plan about the software you want to write, what features it should have and how these features should be implemented.

Most reverse engineering project can't be planned in advance because the target file is nearly always a blackbox. Reverse engineering projects often have the following structure.

1. Get the target file
2. Magic happens here
3. Success

On x86 architectures "Magic happens here" usually starts with using packer identifiers, loading the file into IDA Pro or your favourite debugger. Only after you have looked at the file at least once you can make a vague plan on how to proceed. Basically you invent the plan during the session. This is completely different from planning software in advance.

3. Lots of options on how to proceed

The first two points about the differences between programming and reverse engineering have nothing to do with pair programming or pair reverse engineering itself. They are merely necessary as a precursor for the next three points which highlight the problems of pair reverse engineering.

I like to believe that in software development there are clear and obvious ways how to proceed because you've planned ahead and therefore you have very granular steps where the difference between step n and step n+1 is so small that there are not a whole lot of options on how to get from step n to step n+1. It is therefore very simple for the two programmers in a pair programming team to agree on how to proceed from step n to step n+1. This is especially true because the steps n+2, n+3, n+4 and so on are already known and when you develop the code to get from step n to step n+1 you already know how to proceed afterwards. This additional knowledge about the future helps to shape the current code.

Reverse engineering seems to be fundamentally different. Without the small, well-defined steps to advance the project there are a lot of options on how to proceed. What I witnessed two days ago was that there was constant disagreement about what was important and therefore how to proceed. One example was a DWORD on the stack that had a wrong value and we had to find out where the wrong value came from. The first option was to set a memory access breakpoint on the stack. This idea was challenged because one guy wanted to know exactly what's going on. He wanted to single-step through the next (estimated) 100-200 instructions to see what's going on. Other disagreements included the question whether switching from OllyDbg to SoftICE led to different results, whether the unpacked sample behaves like the packed sample, which unpacked sample we should debug (there were multiple unpacking stages in the original sample and we had several partially unpacked samples), whether the data on the stack or the stack pointer value itself had the wrong value, what the best way is to find out whether a piece of code is used as a decryption key for an encrypted part of the file, when to take VMWare snapshots, and so on.

Maybe it's just that our individual backgrounds were too different and therefore we had a lot of different ideas on what went on in the piece of malware and what parts we consider important. This is definitely a possibility but I like to believe that there are severe shortcomings in pair reverse engineering caused by the necessary guesswork you have to make and the limited knowledge about your future discoveries in the code of the binary file.

Very often during our session I wished for the possibility to fork our group into a parallel universe with the group in universe A to proceed with option X and the group in universe B to proceed with option Y. The structure of the forked groups would resemble a pyramid that parallely explores every single option on how to proceed. Wouldn't that be the bee's knees?

4. No standardized methods

Directly related to the multitude of available options is the lack of standardized methods in reverse engineering. Software development has lots of standardized methods. I can name several without looking them up. There are waterfall models, agile development, extreme programming, test-driven development, model-driven development, random OOA/OOD/OOP approaches, and so on. There are even ISO-certified guidelines for software development as far as I know. Some of the methods I mentioned overlap and all of them can be condensed into buzzwords like I just did and many software developers still have a vague idea what I'm talking about.

If someone asked me to name some standardized reverse engineering methods I'd draw completely blank. Now there are several options for why this is so. Maybe they exist and I am just not aware of them. This is a possibility but if they do exist I doubt they are too widespread. I spend roughly 99.7% of my workday reading random blogs, so I should have read about them before. I think it's far more likely that there was never a push for researching and introducing standardized reverse engineering methods or maybe someone did research them but it didn't work out too well because of my points 1 and 2 or because of other difficulties.

Whatever the reason for the lack of standardized methods is, I believe it is a severe impediment for pair reverse engineering. You can't just tell someone to use method X or method Y if there are no well-defined methods at all. I wonder whether companies that do lots of reverse engineering do have standardized in-house methods that resemble standardized software development methods. This would make perfect sense and maybe the lack of a standardized method was only characteristic of our informal past-midnight RE session and not of RE in general. On the other hand, I've worked on several very informal open-source programming projects with other people before and despite the informality it was always possible to describe the project style and how to add your own code in just a few sentences (code speaks for itself too). This was probably possible because software you develop is a white-box while software you reverse engineer is a blackbox. Standardizing methods for something you know a lot about is probably far simpler than standardizing methods for something you know nothing about.

5. No standardized tools

Another (albeit minor) point is the lack of standardized tools. With the exception of the really widespread tools like IDA Pro, SoftICE or OllyDbg most people use very different tools for reverse engineering. Many tools used are private in-house tools, others are so expensive that only very few people have them and sometimes there are just so many tools to perform a particular task that everybody uses a different tool (PE editors are an example). I wonder how this compares to software development and I am not totally certain. On the one hand the total number of programming related tools probably exceeds the total number of RE related tools by orders of magnitude. On the other hand I feel that the core software development tools are more standardized and the average programmer is more familiar with them. Depending on what kind of software you develop on what platform most software developers probably use the popular core tools A, B, and C (where A could be Eclipse, Visual Studio, and so on). The differences between the used software development tools mainly manifest themselves in the add-ons for the core tools. In RE I see a lot of variety in the core tools themselves (hex editors, PE editors, ...).

Like I said, this is a minor problem that is probably most visible in situations where you randomly meet up for informal RE sessions. In an actual professional RE environment you could just make in-house rules like "everybody must use core tool A" or at least "everybody must know how to use core tool A for pair RE".

Anyway, this post was just a quick brain dump to make sure that I don't forget my impressions of two days ago. I'm going to think more about the parallels and differences between software development and reverse engineering and whether proven practices of software development could be ported to reverse engineering. Let's see what I can come up with.

For the curious, we didn't solve the problem on why the malware crashed in the presence of the monitoring tool. Amusingly we even disagreed on the likely reasons for the crash. My own personal interpretation is that it's 65% likely that there is a bug in the monitoring tool which is not totally transparent and screws up something in the target process (the other guys dismiss this option completely because we ran a test that supposedly eliminates this option; in my opinion the test was run in a compromised VMWare snapshot though and therefore not valid). The odds are 30% that there's a bug in the packer and 5% that the packer intentionally screws up the stack because it detects the monitoring tool. 5% is low but if you'd seen the way the malware crashes you'd probably agree too. It's the one thing we actually agreed upon in our group.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
BBCode format allowed
Form options

Submitted comments will be subject to moderation before being displayed.