Happy Birthday, Mr. Pigeon

Yesterday, I asked a question on Twitter:

I will admit, at the time I was hoping that one of the respondents would be Jesse Kornblum. I was not disappointed. There was a little back on forth on twitter, and then Jesse decided to write a blog post, "Being Unreasonable About MD5." In it, he advised me to use a SHA-2-based hash, or wait until SHA-3 is determined.

If using awesome Rocky IV images in blog posts is wrong, I don't want to be right.
DFIR Slugfest!
It being 2012, we should all know that MD5 is "broken." The first MD5 collision was announced in 2004 (some analysis [PDF]). The ability to generate collisions since then has only gotten better. Earlier this summer we were treated to the sheer awesomeness of Flame exploiting an MD5 collision which could not have been generated based on publicly known techniques (*cough* USA!, USA! *cough*).

Here's the problem, though: I'm looking for a hash function that I can use for lazy equality comparisons, not for use in digital signatures. That is, I want to see whether a new string of input I've read in from a program is the same as a previous string of input I've read in.

Why do I want to do this? Well, I want to grep hard drives for email addresses and spit out a list of all the unique ones I've seen. I also want to do this with web addresses/URLs, credit card numbers, Social Security Numbers, and so on. One possible way of doing this (a really lazy way, and it might not even be the best way; that's a separate story altogether) is to hash all the email addresses with MD5 and call it a day. MD5 will surely work for this, won't it?


In 1834, Johann Dirichlet postulated what has come to be known as the Pigeonhole Principle. It is one of those theorems in mathematics where the proof of it is simply common sense: if you have n pigeonholes, and n+1 pigeons, you're guaranteed to have 2 pigeons shacking up.

Feral Pigeon (Columba livia domestica)

MD5 is a 128 bit hash algorithm. Thus, it has 2^128 different pigeonholes. According to the pigeonhole principle, we need to have (2^128) + 1 items before we're guaranteed to have a collision.

In case you're wondering, 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.  There certainly aren't that many unique email addresses we're ever going to encounter, so we should be good right?

Well, actually, let's take a moment to guess how many unique email addresses there are. I don't have a central repository of all the email addresses ever used, but I can use myself as an example and estimate out from there. Here are all the email addresses I can remember having:

  1. (I don't remember the first one, but it was @nebbs.nersc.gov on a Cray X-MP, which was rad.)
  2. beefstew@aol.com (yes, really)
  3. stew1@iastate.edu
  4. stew1@avalon.net
  5. jonathan.l.stewart@uiowa.edu
  6. jstewart@cs.uiowa.edu
  7. stew1@ellipsis.cx
  8. jonathan_l_stewart@hotmail.com
  9. jonathan_l_stewart@yahoo.com
  10. jonathan.l.stewart@gmail.com
  11. jonathan.l.stewart@acm.org
  12. jonstewart17@comcast.net (DIAF, Comcast)
  13. jon@lightboxtechnologies.com
Lucky number 13. Of course, I may have used a few burner email addresses over the years and there could be a couple of legitimate ones I've forgotten, too. So, let's round that up to 50, just for sake of argument.

Also, let's assume that I'm the most average person on Earth. How many people are there? Supposedly 7.035 billion, as of today, August 24, 2012. Extrapolating out, we could be dealing with 351 billion email addresses, if every single person on the planet has had 50 email addresses. However, there are almost certainly email addresses that are associated with computers and departments and mailing lists and such, not just people. So, well, let's double that: 702 billion email addresses. That's our estimated population.

Right off the bat, we can see that those clinging to 32-bit integers are going to need a bigger boat. If I happened to have a hard drive with all the email addresses on it and I was using a 32-bit hash, the pigeonhole principle tells me I'd have a lot of collisions.

Still, though, MD5 only has to deal with 702,000,000,000 email addresses and fit the hashes from our email addresses into 340,282,366,920,938,463,463,374,607,431,768,211,456 possible values. Continuing along the primrose path of overly simplistic math, this means we should have a probability of 702 billion/2^128, which is 2.062992585693e-27, or 1 chance in 2.06 octillion. That's a lot of room for MD5 to play in.

Nyet so fast, Komrade!
However, thus far I've failed to take into account the Birthday Paradox. Have you ever been in a group of people and been surprised to find that two of them have the same birthday? As it turns out, odds are even that two guests at a 23-person dinner party will have the same birthday and 99% that two people in a group of just 57 will have the same birthday. This is because we are comparing pairs of birthdays to each other, each person trying to pair themselves to all the rest. The number of pairs of n things is (n*(n-1))/2. As people start piling up, so do the combinations and the number of days in a year are quickly overwhelmed.

The Birthday Paradox leads to the Birthday Attack: start testing for collisions by brute force and, over time, you will get lucky and have a pair collide. The number of items we have to test, Q, before finding the first collision in a fair hash function is roughly equal to the square root of (pi * H)/2, where H is the total number of available pigeonholes. For MD5, H=2^128 and thus Q = sqrt((3.1415926 * 2^128)/2). What is that number?

It's ~23,119,565,135,021,060,000. That is, 23 quintillion. That number, while large, is a lot smaller than 340,282,366,920,938,463,463,374,607,431,768,211,456. That's the number of strings we have to hash before we have close to a 100% chance of having a hash collision on MD5. What about a 1% chance? Well, fortunately, Wikipedia calculated that number in the table that you see under the Birthday Attack article. For a fair hash with 128 bits, if we try hashing a tenth of that many items, the probability we'll have a collision is 1 in 100. If we generate ~820,000,000,000,000 strings (820 trillion), the chance we'll have a collision is 1 in a billion. If we generate 1000 times fewer strings, ~820,000,000,000 (820 billion), the chance we'll have a collision is 1 in a quadrillion.

820 billion is pretty close to the 702 billion distinct email addresses we estimated above. So, if MD5 is fair, we can expect we have about a 1 in quadrillion chance that we'll have a collision.

Have you noticed how fast and loose I've played with the math here? And how much I've relied on Wikipedia? Disclaimer: I am really bad at math. When trying to figure out math like this, I generally reserve the right to be off by a factor of 1000. Therefore, you only get to correct me if you can show that I'm off by more than a factor of 1000.

Additionally, all these numbers above assume that MD5 is perfectly fair. In fact, we know that it isn't, because of the collision attacks. If I reserve the right to be off by a factor of 1000 and we know that MD5 is also kind of caput, that means we need to estimate our probability to be somewhat higher. So, one in a trillion? One in 100 billion? Maybe a little less?

Now we're in judgment call territory. For my application of deduping email addresses, well, meh. I'll certainly take the odds of a collision when hashing all of the email addresses ever at one in quadrillion, and even at one in a trillion. The nice thing about using MD5 for this is that it's reasonably fast (Xeons even have a special instruction for calculating MD5 on the chip) and the keys will fit in 16 bytes, the size of an SSE2 register.

However, what about all possible files? Are there more possible files than email addresses? Probably. What about all distinct 512-byte sectors, for those who like sector block hashing? Almost certainly. Are we starting to bump up into limits where we have to think about whether it's safe or not? Yes. Is that a bad thing? Yes. Like Rocky, we do not want to think, we do not want the judge to think, we do not want the jury to think; we want to win.

So, unless you want to think about all these things above, including whether you want to explain one in a trillion, then you should stop using MD5.

Although, to keep things in perspective, the risk of everyone dying next year due to a comet or asteroid strike is only 1 in 20,000.


Ok. Now you may correct my shoddy math in the comments.


Random Sampling in Forensics

(NB: Some links are to PDFs.)

I had the pleasure of attending the Digital Forensics Research Workshop this past week. DFRWS is different than the other forensics conferences I've attended, in that speakers must submit formal academic papers for review by a committee. Less than one-third of submitted papers make the cut.

This is how most academic conferences work, and there are a few consequences stemming from the emphasis on peer-reviewed papers. One is that the quality of the material is generally high. This is good. Another is that many leading practitioners—who, as forensics generally demands, also double as researchers—don't present, since writing a formal academic paper is an enormous PITA. This is less good.

I enjoyed a number of presentations at DFRWS this year (chief among them was Ralf Brown's about language identification). The one that stood out for me, though, was a blend of theory and practice. This was The Use of Random Sampling in Investigations Involving Child Abuse Material, presented by Mike Wilkinson and co-authored by Brian Jones and Syd Pleno.

What Mike, Brian, and Syd did is, frankly, revolutionary. They established procedures at the New South Wales police force's State Electronic Evidence Branch so that child pornography possession cases make use of statistical sampling early in the analysis of the case. By using sampling, they have cut their case backlog from 3+ months to zero.

By using an EnScript that selects a random sample of images from suspects' hard drives, investigators can estimate the total amount of child pornography on the system with a high degree of confidence. A report based on the sampling review is created and, while situations no doubt vary, the report is typically presented to the suspect along with a plea bargain... and most take the plea. Such cases take only a few hours of processing.

Of course, such a methodology is only appropriate for possession cases, not distribution or where active child abuse is suspected. Digital investigators benefit because they don't have to sit for hours reviewing thousands (or millions) of corrosive images, and they have more time for other cases. Other investigators benefit because their computer evidence gets turned around faster. Even the falsely accused benefit, because they are quickly exonerated.

Part of why the procedure works is because it's well-adapted to the law in New South Wales, Australia. However, with care and consideration, it is hard to believe that the basic idea could not be applied in the United States.

I encourage you to read their paper and review their presentation. Kudos to Brian, Syd, and Mike, and to the New South Wales police for adapting to the increasing strain on digital investigators with an elegant and effective solution.