Evaluating SANS Courses

I've only taken one DFIR training course, the EnCase Advanced class back in 2003, the third week into my new job at Guidance Software at their old Green Street location. It was taught by Jessica Bair and Ric Stonesifer, and, knowing nothing about forensics or EnCase, it was a good fast-paced introduction to the field. From there, I learned by doing, maintaining Guidance's core set of artifact EnScripts and then supporting the company's professional services examiners in their casework.

I've forgotten a great deal about old artifacts since then, and have not kept up on new artifacts as much lately. And, of course, there are far more systems and approaches to forensics than existed 15 years ago—macOS, iOS, Android, memory forensics, malware analysis, etc. My dimly-recalled Windows 98 artifact knowledge no longer applies.

Live training is a good option for me because I can turn my phone and email off and simply focus for a week. So, I'm thinking about taking a SANS course this year. Which one should I take? These are the possibilities:
  • FOR500, Windows Forensics. I do know a lot about Windows forensics, so I'm worried I might get a little bored, but this would also be a good tune-up.
  • FOR 508, Advanced IR & Threat Hunting. I've assisted on a number of large-scale IR scenarios, but always in a focused, limited capacity where I've been asked to write code to do a specific thing (usually at scale). Working through the complete chain of an investigation could give me new insight into current problem areas to focus on as a developer, and especially provide me with greater intuition into the needs of IR examiners.
  • FOR 572, Network Forensics. I've focused almost exclusively on host-based forensics, and have always thought that network forensics is not well integrated into traditional forensics investigations (especially IR and IP theft scenarios). I did a bit of libpcap network programming in 2002, but there'd be much here that'd be new to me.
  • FOR 578, Cyber Threat Intelligence. I'm trying to learn much more about threat intelligence and see it as directly applicable to my job. OTOH, the course seems a bit thin; working more hands-on with threat intel data sets would be good practice, but I worry about there being too much focus on jargon (I still am skeptical that "kill chain" means anything that isn't obvious) and on high-level intelligence analysis and techniques (e.g., ACH) and less on tech. It's fine for training classes to address non-technical aspects of a subject, but I personally would get bored and then not get much out of the class.
  • FOR 610, Malware Analysis. Here's something I don't know much about. But would I be better off taking a more in-depth reverse-engineering course from a different organization? Spending time on little utilities like pescan and pdfid would not be useful to me; fundamental deep dives into PE and ELF, unpacking, reverse engineering, call graph analysis, and so on, definitely would be great.
  • FOR 498, Battlefield Forensics. I worked on bulk_extractor and a few other MEDEX projects, and a lot of this seems specific to acquiring different devices and whatnot. Probably not for me.
  • FOR 518, Mac and iOS Forensics. I know a few macOS artifacts and use a Mac as my daily computer, but there'd be a lot here for me to learn.
  • FOR 585, Smartphone Forensics. I know next to nothing about smartphone forensics. This might be inspiring, as my impression of existing mobile forensics tools has not been favorable.
  • FOR 526, Memory Forensics. I also know very little about memory forensics. I do have a pretty good sense of how operating systems and the memory hierarchy work. The fact that this covers multiple operating systems and associated techniques is promising, though.
Are there other courses I should look at, including outside of the SANS curricula? Training works best for me when it's very fast-paced and requires hands-on work. Otherwise, my ADD-addled brain will stop paying attention and I'll just daydream.


2020 Goals

My tech skill development goals for 2020:
  • Build something useful in Rust, with Rust proficiency as a side-effect
  • Develop competence with low-level optimization
    • tooling: nonius, perf, flame graphs, some dtrace/strace
    • assembly: casual familiarity reading x86, comfort with SSE/AVX and associated libraries
    • hardware: learn recent microarchitectures (Haswell–) for Intel and AMD, gain empirical expertise with contemporary memory hierarchies including NVMe
    • libraries: add a few C++ libraries to my trusted working set beyond Boost, focused on addressing particular performance issues (e.g., replacements for std::unordered_map, B-trees, a trie, a Bloom filter lib, etc.)
  • Do everything with some set of Docker, CloudFormation, Vagrant, and Terraform
  • Apply some data science/machine learning techniques to real problems
  • Become dangerously inept with ReactJS and Django as a frontend/backend webapp platform
With the exception of learning how to use ReactJS with Django in order to build a simple webapp, these goals are focused on developing working expertise, building off my strengths in batch processing/backend/systems development. None of these goals involve surface-level reading about new technologies in order to gain conceptual familiarity. I've read about a lot of these technologies for years. It's high time to incorporate them into my daily development activities.


GNU Autotools Grievances

Part 1, of Infinity

  • Adding or renaming a source file requires changing Makefile.am, which means you have to run autoreconf again, which generates a new configure script, which you then must re-run, even though nothing about your configuration has changed.
  • Pulled some changes, now out of date, and you want to run make clean? Have to run ./configure again.
  • Why, yes, my system has strlen. Thank you for checking. Again. And again. And again.
    • Is there a sane response to C development on a system without strlen?
      • NB: Folks running UNICOS don't get to vote.
  • Every function, header, and library you check for in autoconf results in generating a C/C++ program and compiling, all done through shell script, natch.
  • The GNU documentation advises against globbing all source files in a given directory for your Makefile and recommends, instead, listing them all individually. Who does this?
  • Autoconf is really just a bunch of shell macros, which means they introduce new shell variables. So when you see a variable referenced somewhere and you search for it to find where it's set, good luck to you.
  • libtool! Huh! Good God, y'all... What is it good for? Absolutely nothing!
    • Seriously, this is a gigantor shell script whose purpose in life, to the best of my understanding, is to supply the right GCC options for building object code for shared libraries, and then linking them together.
      • That is, it's an enormous shell script middleman to save you the hassle of knowing "-fPIC", "-shared", and ".so". No, really.
    • 20 years ago (maybe even 10; maybe), the platform portability problems libtool was “designed” (a strong word for anything written in shell) to fix might have been an issue. EVERYONE USES LINUX NOW. And those who use FreeBSD still use the same tools.
    • IT HIDES YOUR EXECUTABLES in .folders and generates shell scripts which you think would be your executables, but aren't, and their sole job in life is to fuxor the PATH and then invoke the hidden executables. Good luck using some debugger other than gcc.
  • The configure script is not in any way parallelizable. Running it on a wide enough machine can actually make it the slowest part of building.
  • Want to make a debug build? Rerun configure.
  • Want to use an alternate compiler on your system? Make sure you pass CC and CXX as variables on the command-line every single time you run configure. (Props to SCons for remembering default choices in its build_variables.py file.)
  • Have you ever tried reading through the phantasmagoria that is the shell scripts generated by autoconf? It's seriously the inspiration behind Plato's Allegory of the Cave.


System Reconfig

A year and a half ago, I bought a new system. Meet Virgil:

Virgil's main specs:

  • 2x Xeon E5645 (2.4GHz, 6 cores, 12MB L3, hyperthreading—apiece)
  • the legendary EVGA SR-2 motherboard
  • 24GB RAM
  • a pair of WD RE4 2TB hard drives
  • in a Lian Li PC-V2120 case
  • all kinds of I/O ports and a ho-hum-but-CUDA-capable GPU
The SR-2 motherboard is a non-standard size and will not fit in a normal ATX case. It necessitated the mondo case. At the time, the SR-2 was about the only desktop board that allowed for dual Xeons (otherwise you needed to buy a true server; blech). A year and a half after purchase, it is still far more computer than I need and I sometimes feel bad considering how much it idles. With its cores at only 2.4GHz, some laptops can beat Virgil at single-threaded tasks, but Virgil excels at parallel compiles, data processing, and the like. With Xeon fans, Virgil sounds like a red-lined Ferrari when its under load.

Operating Systems

I first started off running Ubuntu, which I mostly liked. However, the upgrade to 12.04 failed completely—Virgil could not boot, and I had no interest in trouble-shooting the failure. Additionally, I found Ubuntu's Unity interface a little tiresome. I probably appreciated it more than most Linux users, but it involved a bit more mousing around than I wanted.

A colleague is an expert Fedora user, so I installed Fedora. I figure it'd be worthwhile to learn an alternate approach and I could always piggyback off his knowledge. However, I hated Fedora. Gnome 3 is a disaster, Xfce felt like 1998 all over again, and Fedora demanded far too much sysadmin time from me. I didn't want to re-learn how to setup Samba configuration files again, as if it was still 2002; I wanted to click a checkbox and have my home folder shared so I could access it from VMs without any fuss. This was possible on Ubuntu... not so much on Fedora.

Additionally, I gradually realized that the Fedora installer had messed up my RAID. Under Ubuntu, I'd configured the two drives to act as an LVM RAID-1. When I installed Fedora, it could not recognize the partition, which was hugely annoying, but it did seem to recognize the underlying LVM RAID. I was wrong, though: it turned the RAID-1 into a spanned volume. I no longer had a redundant 2TB drive; I had a 4TB drive with twice the probability of failure as a single drive and no redeeming performance qualities.

Of course, by the time I realized this (disclosure: I am really bad at IT/sysadmin), I'd already accumulated a little over 2TB of data, thanks in large part of VirusShare. It would not be so simple to go back to the RAID-1.


So I bought another 2TB WD RE4 hard drive and a 256GB OCZ Vertex 4 SSD. I had two main goals: first, I wanted to install Mint as the main Linux system and I wanted to do it in such a way that I could trash it, install some other system, and be up and running again in a day or less, without too much time spent sysadmin'ing; second, I wanted a RAID for my main data, and I wanted the volume somewhat separate from /home so I could simply remount it as a subdirectory in my home directory under a new system, perhaps issue a recursive chown & chgrp, and be good to go. (Fedora and Ubuntu use different numbering schemes for UIDs, of course; it should be possible to preserve one's UID from system to system with a little bit of work, but the underlying theme of this blog post is how I am too lazy to do such things).

I first installed the SSD in the system and installed the newly released Mint 14 (with the Cinnamon UI) on a single / partition. I then backed up all of my data to an external 3TB drive, installed the 3rd WD drive, and blew my old data away.

Although the SR-2 board supports various RAID options in hardware, I decided I wanted a software RAID so I could survive a motherboard failure. Virgil has compute power to spare, and these are only 7200 RPM drives so I shouldn't suffer significant performance loss by going the software RAID route. I first constructed an unformatted partition on each of the drives, leaving a small buffer before and after in case I someday need to add another drive that differs slightly in its sector count. I followed the instructions to construct a RAID5 array from the three partitions, according the Linux RAID wiki:
mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
This created a 4TB RAID5 across the disks, with a default-selected chunk size of 512KB.


I narrowed my choices for the filesystem to EXT4 or XFS. Since I have a mix of very large files (evidence and other archives) and very small files (source code), I wanted an extent-based filesystem. Despite its advanced feature-set, reading up on BtrFS does not fill one with confidence about its reliability. Of course, one can find data loss stories about EXT4 and XFS as well. In the end I decided to go with XFS as it is an older design that's been strenuously tested on far more advanced RAID hardware, but it's still under active development and maintenance. This is an intuitive choice more than anything.

So I issued the command:
mkfs -t xfs /dev/md0
and immediately was presented with an error that the 512KB chunk size was too large for XFS. It seems that the default chunk size of 512KB is a recent change in the mdadm tools, likely geared towards EXT4. However, the maximum chunk size that XFS can handle is 256KB and XFS developers still recommend a 32KB chunk size in the general case. So, I deleted the RAID and recreated it:

mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1 
mdadm --detail --scan >> /etc/mdadm/mdadm.conf 
mkfs -t xfs /dev/md0
mount /dev/md0 /mnt/raid
mkdir /mnt/raid/jon/data
ln -s /mnt/raid/jon/data /home/jon/data 
I now had my XFS RAID5 and had it easily accessible inside my home directory, while still having the overall volume available for other uses.

One thing I noticed in reading about Linux RAIDs is that many people use LVM on top of the raw RAID device and then create different filesystems on top of LVM. If one were managing a system with a few large RAIDs for a large number of users, I could see how LVM would be useful. However, I cannot see any utility for it when you just want a single RAID volume. There's likely no performance loss to using LVM, but it adds another layer of complexity... and I am too lazy for unnecessary complexity.

The Disposable System

The RAID solves my data problem, but what about being able to get up and running with another system in the future? This enters into the realm of configuration management. There are two new configuration management systems that are taking the open source world by storm, Puppet and Chef. Both are implemented in Ruby and both seem like good pieces of software. They're both geared towards sysadmins who need to manage many systems. Essentially what both do is take configuration files specifying what packages, files, user accounts, and other configuration details should apply to a given system, and then they make the system conform to that description.

The main differentiator between Puppet and Chef is that Puppet has its own syntax for its config files whereas Chef's config files are valid Ruby source code. What I want to do is simply track the packages I install over time, so that I can get them reinstalled quickly if I need to. After looking at the syntax for each of Puppet and Chef, I decided that Puppet would be slightly more simple.

So, what I have been doing is building a directory of Puppet config files, each containing a list of software packages. The config files are roughly categorized, so I have one for basic C/C++ development, another for Python, one for system tools, one for GUI applications, etc. When I find myself in need of a package, I add it to the appropriate config file, and I can then bring my system into compliance with:
sudo puppet apply config-file.pp
Puppet figures out which, if any, packages are missing and then invokes apt-get to download and install them. If I switched back to Fedora, or to FreeBSD or some other crazy system, I would likely need to remap some of these package names slightly, but at least I'm tracking them. Additionally, I will sometimes create an EC2 instance to accomplish a specific task, and having categorized Puppet config files should help me bootstrap barebones instances in short order. I track my config files with git, natch, and mirror them on Github as I do with all my other source code.

Data has mass; I now have a RAID that should be able to survive a move to a different environment without having to make copies. And by using Puppet to reduce system management into, essentially, software development, I have reduced the mass of my operating system to almost nothing. All I need to do on a new system is get it to mount my RAID, drop my ssh keys, install Puppet, and then wait while all the rest of the software I depend on is downloaded and installed automatically.


Speaking of Git, another useful software package for anyone running Linux is etckeeper. etckeeper is a software package that interacts with your package management system and tracks changes to /etc. If you inadvertently screw up some aspect of your system via editing of various /etc config files, you can always go back in time to restore things. In this respect, it's sort of like System Restore, except that it works.


Representing Encoding Chains

Here's a problem I've been mulling over the last couple days: what is a good way to represent encoding chains?

What's an encoding chain?

When we search for text, it's always in some kind of encoding, something that specifies how logical characters are represented as "physical" bytes (a discussion of the physicality of bytes is best left for happy hour). For example, "A" is represented as byte 0x41 in the ANSI or UTF-8 encodings but as the two-byte sequence 0x41 0x00 in UTF-16LE (aka "two-byte Unicode") and UCS-2 (aka "fake two-byte Unicode"; never use UCS-2).

To search for a keyword in a file, we either need to know the encoding used for the file and convert the bytes in the file to logical characters to perform the comparison to the keyword or we need to convert the characters in the keywords into their binary form using the file's encoding and then search for those bytes in the file. Most text editors do the former; Lightgrep does the latter, as we don't usually know the encodings used for files, especially when searching fragments in unallocated space and whatnot.

An encoding chain goes beyond a single encoding: it's what happens when we want to deal with data that's been subject to a series of transformations, including a normal characters-to-bytes encoding.


A useful artifact in forensics is the UserAssist registry key. These keys store recently run programs, recording the name of the program, when it was last run, and the number of times it was run. A bewildering aspect of values under the UserAssist key is that the names of the programs are ROT13 "encrypted". This means that "A" becomes "N", "B" becomes "O", and so on.

ROT13 seems like a silly form of obfuscation... until you need to search for UserAssist registry keys pertaining to a particular program. Many forensic tools will parse registry hives and some will even decrypt the UserAssist keys for you, but if you'd like to look for traces of keys in the pagefile or in unallocated, you are out of luck when it comes to tool support. You must manually apply ROT13 to the name of the program ("notepad" becomes "abgrcnq") and then apply the UTF-16LE encoding to it ("abgrcnq" becomes "a\x00b\x00g\x00r\x00c\x00n\x00q\x00"). Tedious, no?

Another example (featuring more love for Didier Stevens) is when looking for obfuscated strings in malware. XORSearch is a program Messrs. Stevens recently released which lets you search for strings that have been obfuscated with XOR, ROR (rotate-right), or ROL (rotate-left) transformations. These types of transforms are often used by malware authors to obfuscate strings like URLs.

When searching, it would be very nice to say "look for 'notepad' but in ROT13'd UTF-16LE" or "look for 'http://sirhaxal0t.ru/' in UTF-8 that's been XOR'ed". The permutation of "ROT13, then UTF-16LE" or "UTF-8, then XOR" is what I mean when I say "encoding chain." We want to apply a chain of encodings to our search terms.

Characters v. Bytes

ROT13 is a transformation that applies to characters, with "A" becoming "N" and so on. In contrast, XOR is a transformation that applies to bytes. Character transforms apply to the string first before it's encoded into bytes, whereas byte transforms apply to the bytes after the encoding has happened. Here's a table that clearly delineates transformations and encodings:

Character TransformsEncodingsByte Transforms
ROT13, URL encoding, HTML entities ASCII, UTF-8, UTF-16LE, EBCDIC, Shift-JIS, ISO-8859-5, etc. XOR, ROR, ROL, Outlook Compressible Encryption, Base64

You may have noticed that there are a few constraints to the encoding chains we'd like to apply. First, there's an ordering constraint in that character transforms must come first, followed by the encoding, and finally by byte transforms. An encoding chain like "XOR -> ROT13 -> UTF-8" is invalid and nonsensical. Second, while it's possible to have multiple character transforms or byte transforms in a chain, there can be only one encoding. An encoding changes the type of thing we're working with from characters (ideally Unicode codepoints) to bytes, and that type change can only occur once.

The Ultimate Problem

We've been working on adding support for Outlook Compressible Encryption to Lightgrep, and that's necessitated thinking about encoding chains and how to expose the abstraction. At this point the main technical work is done and we are arguing about how to represent the encoding chains.




"ROT13 | UTF-8 | Base64 | Outlook"


"ROT13 > UTF-8 > Base64 > Outlook"


We'd like to have a good notation for specifying these chains. One interesting thing to note is that some transformations may want parameters. For example, ROR and ROL need to know how many bits of shift to apply. It would be silly to have ROR1, ROR2, ROR3, ROR4, ROR5, ROR6, and ROR7, when we could just have "ROR(2)", for example. With parameters, we could come up with a generic rotate character transformation, and then ROT13 would be "ROT(13)". (If you were investigating a classically trained hacker, you may want to look for text obfuscated by the Caesar cipher, i.e., ROT(3).)


...which representation above do you prefer? Or can you come up with something better?

* Addendum *

You can expect Outlook Compressible Encryption support in Lightgrep fairly soon, but it will be some time before you see base64. Base64 is not so much a byte transformation as it is a bit transformation, taking 6 bits of data and turning them into 8. Lightgrep needs some work in order to work well with bits.


Forensic Programming Rules of Thumb

I have a set of mental guidelines I use when writing digital forensics programs, and it's high time I write them down. These are by no means a complete list of useful programming techniques, but I think they're all worth keeping in mind.


1. Don't be afraid to use RAM. Most investigators' machines have several gigabytes of RAM installed, so you might as well use it.

2. Remember, though, that RAM is fixed while disk space is nearly infinite. If you can't set an upper limit on the amount of RAM used by your program, create a cache data structure, size it appropriately, then spill to disk if you exceed it.

3. Prefer streaming models to building up complex data structures in RAM. Streaming models will generally use a fixed amount of memory and scale to large evidence sets.

4. Choose algorithms based on the best worst-case behavior, rather than the best average-case. Investigators don't need the best possible performance from their software as much as they need predictable performance, and the ability for the software to cope when faced with The Big One. An algorithm that has bad worst-case performance is not mission capable.


1. Convert local time to UTC as soon as possible.

2. Always report timestamps in UTC. I'm generally not a fan of programmers telling users what to do, but I'll make an exception in this case; when users ask for local time output, firmly reply "no" and tell them to get comfortable working in UTC.

3. Do not separate times from dates in output. Timestamps are points in time; separating times from dates imposes the periodicity of the earth's revolution onto your investigation, and that's often not germane. Even if it is germane, having dates and times separated makes it a huge PITA to calculate timestamp differences in, e.g., Excel.

4. Timestamps generally exist within a "context," for lack of a better word. Contexts are often cascading/recursive. For example, the context for an OLE timestamp with a Word 2003 document is "Word 2003 OLE", followed by the filesystem (e.g. UTC for NTFS, local time adjustment for FAT32), followed by the system time settings, followed by, say, the timezone of where the evidence was seized. It's important to remember these contexts when handling local time. Even if you convert local time to UTC, one of these contexts could later be found to be incorrect. Consider the timestamps as existing in a database; it would be better to query for the appropriate rows to adjust by this context then through some ad hoc mechanism.


1. Never assume anything about your input. Ooh, you found a magic header for your artifact? So what!? The data could have been generated by a cuckoo clock. Sanity check everything, especially when reading in variable length buffers/strings with sizes denoted in other fields. Return true or false depending on whether the artifact parsed successfully. If it mostly parsed but failed some kind of test, return the partial results along with false.

2. Keep your artifact parsing code separate from your I/O code (if possible; better to operate on provided memory buffers and let other code feed it data) and separate from your reporting code. This way you can foster code reuse when someone else needs to parse the same artifact but in a different context.

3. Put the immediate results from parsing into a struct or, perhaps better, a name-value data structure, such as a Python dict. Don't serialize into a string or otherwise obliterate the type information you have at that moment; that's for your output code to handle. Again, this fosters code reuse.

4. Always refer to bytes on disk. When you output parsed artifacts, always give the user reference to where that artifact existed on disk. This way the user has a chance in hell of finding it for themselves and validating that your software parsed it correctly.

5. Remember that parsing is not the same thing as analysis. The output from parsing an artifact should be in a form that makes it convenient for the examiner to analyze. This could mean, for example, that it's easy to get the output into Excel or Access or pipe it into different command-line utilities. It's way better to produce output that is well-structured than pretty—well-structured data can always be prettied up, while it's often hard to recreate structure from pretty data.

6. It's better to identify an artifact by searching for it than by assuming it exists within a well-structured file. If you can identify artifacts by searching, then you can parse them out of RAM dumps, virtual memory swapfiles, and unallocated space.


Happy Birthday, Mr. Pigeon

Yesterday, I asked a question on Twitter:

I will admit, at the time I was hoping that one of the respondents would be Jesse Kornblum. I was not disappointed. There was a little back on forth on twitter, and then Jesse decided to write a blog post, "Being Unreasonable About MD5." In it, he advised me to use a SHA-2-based hash, or wait until SHA-3 is determined.

If using awesome Rocky IV images in blog posts is wrong, I don't want to be right.
DFIR Slugfest!
It being 2012, we should all know that MD5 is "broken." The first MD5 collision was announced in 2004 (some analysis [PDF]). The ability to generate collisions since then has only gotten better. Earlier this summer we were treated to the sheer awesomeness of Flame exploiting an MD5 collision which could not have been generated based on publicly known techniques (*cough* USA!, USA! *cough*).

Here's the problem, though: I'm looking for a hash function that I can use for lazy equality comparisons, not for use in digital signatures. That is, I want to see whether a new string of input I've read in from a program is the same as a previous string of input I've read in.

Why do I want to do this? Well, I want to grep hard drives for email addresses and spit out a list of all the unique ones I've seen. I also want to do this with web addresses/URLs, credit card numbers, Social Security Numbers, and so on. One possible way of doing this (a really lazy way, and it might not even be the best way; that's a separate story altogether) is to hash all the email addresses with MD5 and call it a day. MD5 will surely work for this, won't it?


In 1834, Johann Dirichlet postulated what has come to be known as the Pigeonhole Principle. It is one of those theorems in mathematics where the proof of it is simply common sense: if you have n pigeonholes, and n+1 pigeons, you're guaranteed to have 2 pigeons shacking up.

Feral Pigeon (Columba livia domestica)

MD5 is a 128 bit hash algorithm. Thus, it has 2^128 different pigeonholes. According to the pigeonhole principle, we need to have (2^128) + 1 items before we're guaranteed to have a collision.

In case you're wondering, 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.  There certainly aren't that many unique email addresses we're ever going to encounter, so we should be good right?

Well, actually, let's take a moment to guess how many unique email addresses there are. I don't have a central repository of all the email addresses ever used, but I can use myself as an example and estimate out from there. Here are all the email addresses I can remember having:

  1. (I don't remember the first one, but it was @nebbs.nersc.gov on a Cray X-MP, which was rad.)
  2. beefstew@aol.com (yes, really)
  3. stew1@iastate.edu
  4. stew1@avalon.net
  5. jonathan.l.stewart@uiowa.edu
  6. jstewart@cs.uiowa.edu
  7. stew1@ellipsis.cx
  8. jonathan_l_stewart@hotmail.com
  9. jonathan_l_stewart@yahoo.com
  10. jonathan.l.stewart@gmail.com
  11. jonathan.l.stewart@acm.org
  12. jonstewart17@comcast.net (DIAF, Comcast)
  13. jon@lightboxtechnologies.com
Lucky number 13. Of course, I may have used a few burner email addresses over the years and there could be a couple of legitimate ones I've forgotten, too. So, let's round that up to 50, just for sake of argument.

Also, let's assume that I'm the most average person on Earth. How many people are there? Supposedly 7.035 billion, as of today, August 24, 2012. Extrapolating out, we could be dealing with 351 billion email addresses, if every single person on the planet has had 50 email addresses. However, there are almost certainly email addresses that are associated with computers and departments and mailing lists and such, not just people. So, well, let's double that: 702 billion email addresses. That's our estimated population.

Right off the bat, we can see that those clinging to 32-bit integers are going to need a bigger boat. If I happened to have a hard drive with all the email addresses on it and I was using a 32-bit hash, the pigeonhole principle tells me I'd have a lot of collisions.

Still, though, MD5 only has to deal with 702,000,000,000 email addresses and fit the hashes from our email addresses into 340,282,366,920,938,463,463,374,607,431,768,211,456 possible values. Continuing along the primrose path of overly simplistic math, this means we should have a probability of 702 billion/2^128, which is 2.062992585693e-27, or 1 chance in 2.06 octillion. That's a lot of room for MD5 to play in.

Nyet so fast, Komrade!
However, thus far I've failed to take into account the Birthday Paradox. Have you ever been in a group of people and been surprised to find that two of them have the same birthday? As it turns out, odds are even that two guests at a 23-person dinner party will have the same birthday and 99% that two people in a group of just 57 will have the same birthday. This is because we are comparing pairs of birthdays to each other, each person trying to pair themselves to all the rest. The number of pairs of n things is (n*(n-1))/2. As people start piling up, so do the combinations and the number of days in a year are quickly overwhelmed.

The Birthday Paradox leads to the Birthday Attack: start testing for collisions by brute force and, over time, you will get lucky and have a pair collide. The number of items we have to test, Q, before finding the first collision in a fair hash function is roughly equal to the square root of (pi * H)/2, where H is the total number of available pigeonholes. For MD5, H=2^128 and thus Q = sqrt((3.1415926 * 2^128)/2). What is that number?

It's ~23,119,565,135,021,060,000. That is, 23 quintillion. That number, while large, is a lot smaller than 340,282,366,920,938,463,463,374,607,431,768,211,456. That's the number of strings we have to hash before we have close to a 100% chance of having a hash collision on MD5. What about a 1% chance? Well, fortunately, Wikipedia calculated that number in the table that you see under the Birthday Attack article. For a fair hash with 128 bits, if we try hashing a tenth of that many items, the probability we'll have a collision is 1 in 100. If we generate ~820,000,000,000,000 strings (820 trillion), the chance we'll have a collision is 1 in a billion. If we generate 1000 times fewer strings, ~820,000,000,000 (820 billion), the chance we'll have a collision is 1 in a quadrillion.

820 billion is pretty close to the 702 billion distinct email addresses we estimated above. So, if MD5 is fair, we can expect we have about a 1 in quadrillion chance that we'll have a collision.

Have you noticed how fast and loose I've played with the math here? And how much I've relied on Wikipedia? Disclaimer: I am really bad at math. When trying to figure out math like this, I generally reserve the right to be off by a factor of 1000. Therefore, you only get to correct me if you can show that I'm off by more than a factor of 1000.

Additionally, all these numbers above assume that MD5 is perfectly fair. In fact, we know that it isn't, because of the collision attacks. If I reserve the right to be off by a factor of 1000 and we know that MD5 is also kind of caput, that means we need to estimate our probability to be somewhat higher. So, one in a trillion? One in 100 billion? Maybe a little less?

Now we're in judgment call territory. For my application of deduping email addresses, well, meh. I'll certainly take the odds of a collision when hashing all of the email addresses ever at one in quadrillion, and even at one in a trillion. The nice thing about using MD5 for this is that it's reasonably fast (Xeons even have a special instruction for calculating MD5 on the chip) and the keys will fit in 16 bytes, the size of an SSE2 register.

However, what about all possible files? Are there more possible files than email addresses? Probably. What about all distinct 512-byte sectors, for those who like sector block hashing? Almost certainly. Are we starting to bump up into limits where we have to think about whether it's safe or not? Yes. Is that a bad thing? Yes. Like Rocky, we do not want to think, we do not want the judge to think, we do not want the jury to think; we want to win.

So, unless you want to think about all these things above, including whether you want to explain one in a trillion, then you should stop using MD5.

Although, to keep things in perspective, the risk of everyone dying next year due to a comet or asteroid strike is only 1 in 20,000.


Ok. Now you may correct my shoddy math in the comments.