Alright lazy web, yes or no: can I use md5() for lazy man's equality testing of strings ranging from 3 bytes to 3GB?I will admit, at the time I was hoping that one of the respondents would be Jesse Kornblum. I was not disappointed. There was a little back on forth on twitter, and then Jesse decided to write a blog post, "Being Unreasonable About MD5." In it, he advised me to use a SHA-2-based hash, or wait until SHA-3 is determined.
— Jon Stewart (@codeslack) August 23, 2012
DFIR Slugfest! |
Here's the problem, though: I'm looking for a hash function that I can use for lazy equality comparisons, not for use in digital signatures. That is, I want to see whether a new string of input I've read in from a program is the same as a previous string of input I've read in.
Why do I want to do this? Well, I want to grep hard drives for email addresses and spit out a list of all the unique ones I've seen. I also want to do this with web addresses/URLs, credit card numbers, Social Security Numbers, and so on. One possible way of doing this (a really lazy way, and it might not even be the best way; that's a separate story altogether) is to hash all the email addresses with MD5 and call it a day. MD5 will surely work for this, won't it?
...
In 1834, Johann Dirichlet postulated what has come to be known as the Pigeonhole Principle. It is one of those theorems in mathematics where the proof of it is simply common sense: if you have n pigeonholes, and n+1 pigeons, you're guaranteed to have 2 pigeons shacking up.
MD5 is a 128 bit hash algorithm. Thus, it has 2^128 different pigeonholes. According to the pigeonhole principle, we need to have (2^128) + 1 items before we're guaranteed to have a collision.
In case you're wondering, 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.
Well, actually, let's take a moment to guess how many unique email addresses there are. I don't have a central repository of all the email addresses ever used, but I can use myself as an example and estimate out from there. Here are all the email addresses I can remember having:
- (I don't remember the first one, but it was @nebbs.nersc.gov on a Cray X-MP, which was rad.)
- beefstew@aol.com (yes, really)
- stew1@iastate.edu
- stew1@avalon.net
- jonathan.l.stewart@uiowa.edu
- jstewart@cs.uiowa.edu
- stew1@ellipsis.cx
- jonathan_l_stewart@hotmail.com
- jonathan_l_stewart@yahoo.com
- jonathan.l.stewart@gmail.com
- jonathan.l.stewart@acm.org
- jonstewart17@comcast.net (DIAF, Comcast)
- jon@lightboxtechnologies.com
Lucky number 13. Of course, I may have used a few burner email addresses over the years and there could be a couple of legitimate ones I've forgotten, too. So, let's round that up to 50, just for sake of argument.
Also, let's assume that I'm the most average person on Earth. How many people are there? Supposedly 7.035 billion, as of today, August 24, 2012. Extrapolating out, we could be dealing with 351 billion email addresses, if every single person on the planet has had 50 email addresses. However, there are almost certainly email addresses that are associated with computers and departments and mailing lists and such, not just people. So, well, let's double that: 702 billion email addresses. That's our estimated population.
Right off the bat, we can see that those clinging to 32-bit integers are going to need a bigger boat. If I happened to have a hard drive with all the email addresses on it and I was using a 32-bit hash, the pigeonhole principle tells me I'd have a lot of collisions.
Still, though, MD5 only has to deal with 702,000,000,000 email addresses and fit the hashes from our email addresses into 340,282,366,920,938,463,463,374,607,431,768,211,456 possible values. Continuing along the primrose path of overly simplistic math, this means we should have a probability of 702 billion/2^128, which is 2.062992585693e-27, or 1 chance in 2.06 octillion. That's a lot of room for MD5 to play in.
Nyet so fast, Komrade! |
The Birthday Paradox leads to the Birthday Attack: start testing for collisions by brute force and, over time, you will get lucky and have a pair collide. The number of items we have to test, Q, before finding the first collision in a fair hash function is roughly equal to the square root of (pi * H)/2, where H is the total number of available pigeonholes. For MD5, H=2^128 and thus Q = sqrt((3.1415926 * 2^128)/2). What is that number?
It's ~23,119,565,135,021,060,000. That is, 23 quintillion. That number, while large, is a lot smaller than 340,282,366,920,938,463,463,374,607,431,768,211,456. That's the number of strings we have to hash before we have close to a 100% chance of having a hash collision on MD5. What about a 1% chance? Well, fortunately, Wikipedia calculated that number in the table that you see under the Birthday Attack article. For a fair hash with 128 bits, if we try hashing a tenth of that many items, the probability we'll have a collision is 1 in 100. If we generate ~820,000,000,000,000 strings (820 trillion), the chance we'll have a collision is 1 in a billion. If we generate 1000 times fewer strings, ~820,000,000,000 (820 billion), the chance we'll have a collision is 1 in a quadrillion.
820 billion is pretty close to the 702 billion distinct email addresses we estimated above. So, if MD5 is fair, we can expect we have about a 1 in quadrillion chance that we'll have a collision.
Have you noticed how fast and loose I've played with the math here? And how much I've relied on Wikipedia? Disclaimer: I am really bad at math. When trying to figure out math like this, I generally reserve the right to be off by a factor of 1000. Therefore, you only get to correct me if you can show that I'm off by more than a factor of 1000.
Additionally, all these numbers above assume that MD5 is perfectly fair. In fact, we know that it isn't, because of the collision attacks. If I reserve the right to be off by a factor of 1000 and we know that MD5 is also kind of caput, that means we need to estimate our probability to be somewhat higher. So, one in a trillion? One in 100 billion? Maybe a little less?
Now we're in judgment call territory. For my application of deduping email addresses, well, meh. I'll certainly take the odds of a collision when hashing all of the email addresses ever at one in quadrillion, and even at one in a trillion. The nice thing about using MD5 for this is that it's reasonably fast (Xeons even have a special instruction for calculating MD5 on the chip) and the keys will fit in 16 bytes, the size of an SSE2 register.
However, what about all possible files? Are there more possible files than email addresses? Probably. What about all distinct 512-byte sectors, for those who like sector block hashing? Almost certainly. Are we starting to bump up into limits where we have to think about whether it's safe or not? Yes. Is that a bad thing? Yes. Like Rocky, we do not want to think, we do not want the judge to think, we do not want the jury to think; we want to win.
So, unless you want to think about all these things above, including whether you want to explain one in a trillion, then you should stop using MD5.
Although, to keep things in perspective, the risk of everyone dying next year due to a comet or asteroid strike is only 1 in 20,000.
...
Ok. Now you may correct my shoddy math in the comments.
Actually, I think you could get by with a much faster Jenkins hash.
ReplyDeleteI'm not sure how much memory your hash table is chewing up, but I'm curious why not use a bloom filter? Is it because you are primarily CPU bound?
I've found this to be a handy comparison of non-cryptographic hash functions:
ReplyDeletehttp://www.strchr.com/hash_functions
Sadly, it doesn't include cryptographic hashes. It would be interesting to see graphically just how much slower MD5 is than, say, Jenkins.
For this application, I want to assign a unique ID to each distinct string I get from a grep search, e.g., an email address. MD5 can serve as that unique ID, and the user is free to use that as they wish. For example, they can make a histogram of email addresses.
The other, and potentially better, way to do it is for the user-level code above the grep search to use a PATRICIA or some such for recording metadata about the distinct strings. I don't particularly think there'd be so many unique instances that a contemporary machine would run into issues with memory usage.
Bloom filters are great for filtering out queries that'd hit disk or other systems, but they don't allow for key-value lookup. Also, tuning a Bloom filter to have a 1 in quadrillion chance of being wrong seems like a losing proposition.
Somehow I always knew it was you, beefstew@aol.com...
ReplyDeleteYou may, of course, use whichever algorithm you feel like. But if you're looking for best practices, US-CERT declared MD5 to be a "do not use" algorithm back in 2008 (http://www.kb.cert.org/vuls/id/836068). Regardless of whether the algorithm can be safely used in this context or not (my opinion is that it probably can be), the algorithm should still be avoided and other, better ones pursued.
ReplyDelete