During the last year I have implemented quite a few file format parsers for a variety of reverse engineering tools, some in the context of malware detection and others in the context of vulnerability analysis. I wrote file parsers for complex modern file formats like SWF and PDF and for obscure file formats that are older than I and some that are nearly as old as my parents! In total I have written file format parsers for probably around 15 file formats and I have made some observations about the whole process I would like to share.
File format parsers for reverse engineering tools differ significantly from file format parsers for other programs like authoring tools or media players. Here is what I mean. Imagine you want to write a player for Flash animations. You have to parse SWF input files and display the animations defined there. You are basically working with a file format (SWF). On the other hand, if you are writing a reverse engineering tool you are not targeting a file format. You are really targeting the standard application for working with this file format (like Adobe Flash Player), be it for vulnerability research, for malware detection, or for some other reason. This has a big impact on how you have to write your parser. The official specification of a file format is not enough to implement your parser. You have to reverse engineer the targeted application and figure out how it handles all those little idiosyncracies (like for example error handling and recovery) that are not in the spec. Then you have to implement your parser to behave like the targeted application in those edge cases.
Closely related to this idea is that your parser needs to make very sure to separate parsing from verification. It is tempting to combine these two steps and in media players where you can expect to work with valid files it makes some sense. If you detect things like invalid values for important enumeration fields you can bail and tell the user that the file is invalid and cannot be played. When you are writing a parser for a reverse engineering tool, your parser is not the ultimate arbiter of correctness. Rather, the targeted application is. This is especially true if you want to combine your parser with a file fuzzing tool. If your tool rejects malformed files you will possibly miss bugs in the targeted application and fuzzer authors will hate you.
Rather, it is incredibly important to add recovery to your parser. Your parser needs to be able to handle all those malformed files the fuzzer is throwing at it. Here, your parser actually has to be more liberal than the targeted application. While the targeted application (such as Flash Player) can bail out at one point and just tell the user that it is not able to process the file, your reverse engineering tool must still process as much as possible and display the results to the reverse engineer. A very common situation is that your parser comes across a malformed structure and runs beyond the end of the file trying to parse it. Detecting what could be wrong with this structure and recovering from the problem is immensely important. Otherwise, everything behind that structure in the file will just be a big blob of uselessness to your parser and the reverse engineer who is using it.
Once you have thought about all this you should invest some time in writing a binary file parser framework. Let's face it, if you are ever implementing a file parser for a reverse engineering tool you are likely to implement another one. And possibly for really complex file formats too! The thing is, there are file formats out there where the specification is hundreds of pages if not more than a thousand pages. Writing parsers for these file formats sucks if you do not have a domain-specific language for binary file parsing. You will toil away writing boilerplate code forever, especially if you are working in verbose languages like Java and C#. You can save a lot of time first creating a DSL and then implementing the parser on top of the DSL. This DSL should have just one goal: to allow you to implement the parser as fast as possible. Nobody is interested in writing a parser. You actually want to use the results of the parser, so you better hurry to get there!
Alright, now you can get started with your DSL. I do not want to go into the details of how to structure your DSL, but here is a tip. When you come across a UINT32 (or another primitive value) field do not parse this as 'unsigned int' or whatever the equivalent is in your language. Create a class for it. Add things like the offset in the file and the length of the structure you are parsing. Maybe even create a common super-class for all your parsed field types and think a bit about the possibilities this opens. You will thank me later.
Thank you for your comment. I use 010 Editor myself at work and it is great. The template engine they have is indeed quite useful and the syntax is relatively compact. If you do not need to write a stand-alone tool 010 Editor is a good alternative.