There's a filter that ships with EnCase which does the same thing. Ever have a little mistake slip by you, and then find you have to live with it forever? That's what happened with that filter, and I'm sorry: it's my fault. As Oliver, and others point out, the filter maintains its set of hash values as a flat NameListClass. For every file, it looks through the list for that file's hash. If you have n files and m hash values, that means the filter will run in O(nm) time. If n ≈ m, then that means it runs in O(n²) time; it's quadratic.
While the filter is simple, that's not enough to account for the fact that it takes an unacceptable—and needlessly so—amount of time to run. So, a long time ago, I wrote a different version, that uses a binary search tree, built out of NameListClass as well, to do the lookups. This reduces the running time of the algorithm to O(n log n). As Lance points out, log n of 1,000,000 is approximately 20 (if 2 is the base of the logarithm, not 10). If you have a million files, it will take 20,000,000 operations to run the filter using the binary search tree. That's a lot better than 1,000,000,000,000 (1 trillion) operations.
In the comments, someone posted the code from the filter. It relies on the binary search tree implementation in GSI_Basic.EnScript, that's in the default Include/ folder. As it turns out, the better filter had once shipped with EnCase (I think; I may be wrong), but the filters were accidentally replaced during v6 development, prior to its release. The sad thing is that I knew about it, and should have spent 30 minutes to fix it, but it was always lower priority than other problems. Unfortunately, what would have cost me 30 minutes has ended up costing everyone weeks (months? years?) of processing time.
(As an aside, this is a good example of why life shouldn't be handled as a priority queue. Sometimes there's low-hanging fruit that you can't get to if you always work on the highest priority problem, and that low-hanging, but lower priority, fruit might be tasty and nourishing.)
While it's true that the binary search tree approach is faster than Oliver's filter, I'd encourage you to use Oliver's filter instead. While a database will almost never be the fastest approach, it won't be bad. Compared to hours and hours of processing time for the quadratic filter, the difference between 3 minutes and 20 seconds is negligible. And for that additional 2 minutes and 40 seconds of trade-off, you'll get the ability to use more hash values than you have RAM, something the binary search tree filter can't handle, and you'll be able to manage your hash values conveniently in a database.
(If you have billions and billions of hash values, more exotic solutions may be necessary.)
The binary search tree filter isn't even the best possible way. The code on Lance's blog converts every hash value to a string and relies on lexigraphical sorting of hash values. This adds a lot of overhead. It performs a lot of memory allocations since a new NameListClass object must be allocated for every hash value. These objects may be located in different parts of RAM, so you lose the benefits of L1 and L2 cache. What's more, you ignore some of the properties of hash values themselves... you end up spending too much time maintaining the order of the binary search tree when you could take advantage of how a hash works.
For example, what you could do is take a guess at the number of hash values you'll have. For Unique Files by Hash, you know you'll have an upper bound of the number of files in the case. So, create a MemoryFileClass buffer. Multiply 16 (MD5) by the total number of files, and then add an extra 50%. Now divide that buffer into 65,536 blocks. Instead of creating a binary search tree, put all the hash values that have the same first two bytes into each chunk. You'll have more of some than others, but since we added our fudge factor, you'll still have room, and it won't vary too much because of the cryptographic properties of MD5 (if things were lopsided, it'd be a bad hash function). At the beginning of the chunk, store the number of hashes in that chunk. Then sort the hashes in each chunk.
Now, as you go, you take the first two bytes of any hash value. Seek directly to that chunk of the MemoryFileClass buffer. Congratulations! You have now brought all those sorted, closely related hash values into L2 cache (if not L1). Additionally, you've eliminated roughly 99.998% of the possible hash values you have to consider. You now only need to binary search the hashes in the relevant chunk, and this will be fast because you're dealing with contiguous bytes in RAM, instead of objects that stores hashes as strings. This would be much much much faster than the binary search tree approach.
Or you can just use Oliver's filter, which I encourage you to do.
The binary search tree filter isn't even the best possible way. The code on Lance's blog converts every hash value to a string and relies on lexigraphical sorting of hash values. This adds a lot of overhead. It performs a lot of memory allocations since a new NameListClass object must be allocated for every hash value. These objects may be located in different parts of RAM, so you lose the benefits of L1 and L2 cache. What's more, you ignore some of the properties of hash values themselves... you end up spending too much time maintaining the order of the binary search tree when you could take advantage of how a hash works.
For example, what you could do is take a guess at the number of hash values you'll have. For Unique Files by Hash, you know you'll have an upper bound of the number of files in the case. So, create a MemoryFileClass buffer. Multiply 16 (MD5) by the total number of files, and then add an extra 50%. Now divide that buffer into 65,536 blocks. Instead of creating a binary search tree, put all the hash values that have the same first two bytes into each chunk. You'll have more of some than others, but since we added our fudge factor, you'll still have room, and it won't vary too much because of the cryptographic properties of MD5 (if things were lopsided, it'd be a bad hash function). At the beginning of the chunk, store the number of hashes in that chunk. Then sort the hashes in each chunk.
Now, as you go, you take the first two bytes of any hash value. Seek directly to that chunk of the MemoryFileClass buffer. Congratulations! You have now brought all those sorted, closely related hash values into L2 cache (if not L1). Additionally, you've eliminated roughly 99.998% of the possible hash values you have to consider. You now only need to binary search the hashes in the relevant chunk, and this will be fast because you're dealing with contiguous bytes in RAM, instead of objects that stores hashes as strings. This would be much much much faster than the binary search tree approach.
Or you can just use Oliver's filter, which I encourage you to do.
What you are describing at the end bit there sounds like a poor man's Bloom Filter. I highly recommend using the real thing: faster & uses less memory.
ReplyDeleteVassil Roussev and Golden Richard had a paper in 2006 about using Bloom filters for hash set analysis:
ReplyDeletehttp://www.dfrws.org/2006/proceedings/11-Roussev.pdf
Dave White at NIST put together a presentation on using Bloom filters with the NSRL:
http://www.nsrl.nist.gov/Documents/aafs2008/dw-2-AAFS-2008-bloom.pdf
Some of the NIST's research on this topic is available here:
http://www.nsrl.nist.gov/RDS/rds_2.13/bloom/
However, it doesn't seem like they're providing a Bloom filter with the current shipping NSRL sets.
Paul Farrell, Simson Garfinkel, and Dave White explored these ideas further in this 2008 paper:
http://www.simson.net/clips/academic/2008.ACSAC.Bloom.pdf
One thing about forensics is that we want the false positive rate to be very low. This requires a pretty large Bloom filter. A reasonably-sized Bloom filter used in concert with a database (for double-checking matches) might be a good trade-off.
Of course, writing a Bloom filter implementation in EnScript might be, uh, suboptimal.