2012-09-17

Forensic Programming Rules of Thumb

I have a set of mental guidelines I use when writing digital forensics programs, and it's high time I write them down. These are by no means a complete list of useful programming techniques, but I think they're all worth keeping in mind.

Algorithms

1. Don't be afraid to use RAM. Most investigators' machines have several gigabytes of RAM installed, so you might as well use it.

2. Remember, though, that RAM is fixed while disk space is nearly infinite. If you can't set an upper limit on the amount of RAM used by your program, create a cache data structure, size it appropriately, then spill to disk if you exceed it.

3. Prefer streaming models to building up complex data structures in RAM. Streaming models will generally use a fixed amount of memory and scale to large evidence sets.

4. Choose algorithms based on the best worst-case behavior, rather than the best average-case. Investigators don't need the best possible performance from their software as much as they need predictable performance, and the ability for the software to cope when faced with The Big One. An algorithm that has bad worst-case performance is not mission capable.

Timestamps

1. Convert local time to UTC as soon as possible.

2. Always report timestamps in UTC. I'm generally not a fan of programmers telling users what to do, but I'll make an exception in this case; when users ask for local time output, firmly reply "no" and tell them to get comfortable working in UTC.

3. Do not separate times from dates in output. Timestamps are points in time; separating times from dates imposes the periodicity of the earth's revolution onto your investigation, and that's often not germane. Even if it is germane, having dates and times separated makes it a huge PITA to calculate timestamp differences in, e.g., Excel.

4. Timestamps generally exist within a "context," for lack of a better word. Contexts are often cascading/recursive. For example, the context for an OLE timestamp with a Word 2003 document is "Word 2003 OLE", followed by the filesystem (e.g. UTC for NTFS, local time adjustment for FAT32), followed by the system time settings, followed by, say, the timezone of where the evidence was seized. It's important to remember these contexts when handling local time. Even if you convert local time to UTC, one of these contexts could later be found to be incorrect. Consider the timestamps as existing in a database; it would be better to query for the appropriate rows to adjust by this context then through some ad hoc mechanism.

Artifacts

1. Never assume anything about your input. Ooh, you found a magic header for your artifact? So what!? The data could have been generated by a cuckoo clock. Sanity check everything, especially when reading in variable length buffers/strings with sizes denoted in other fields. Return true or false depending on whether the artifact parsed successfully. If it mostly parsed but failed some kind of test, return the partial results along with false.

2. Keep your artifact parsing code separate from your I/O code (if possible; better to operate on provided memory buffers and let other code feed it data) and separate from your reporting code. This way you can foster code reuse when someone else needs to parse the same artifact but in a different context.

3. Put the immediate results from parsing into a struct or, perhaps better, a name-value data structure, such as a Python dict. Don't serialize into a string or otherwise obliterate the type information you have at that moment; that's for your output code to handle. Again, this fosters code reuse.

4. Always refer to bytes on disk. When you output parsed artifacts, always give the user reference to where that artifact existed on disk. This way the user has a chance in hell of finding it for themselves and validating that your software parsed it correctly.

5. Remember that parsing is not the same thing as analysis. The output from parsing an artifact should be in a form that makes it convenient for the examiner to analyze. This could mean, for example, that it's easy to get the output into Excel or Access or pipe it into different command-line utilities. It's way better to produce output that is well-structured than pretty—well-structured data can always be prettied up, while it's often hard to recreate structure from pretty data.

6. It's better to identify an artifact by searching for it than by assuming it exists within a well-structured file. If you can identify artifacts by searching, then you can parse them out of RAM dumps, virtual memory swapfiles, and unallocated space.