Codeslack: 2010-11

The other day I blogged about the perils of a hash value-related filter that exhibits quadratic time. Sometimes there's a hack you can use to make the simplicity of a quadratic nested loop a bit more palatable, performance-wise. Here's a script I wrote a couple of weeks ago to give me the unique list of folders containing selected files:

class MainClass {

  void Main(CaseClass c) {

    if (c) {

      EntryClass root = c.EntryRoot();

      NameListClass list();

      String folder,

             last;

      forall (EntryClass e in root) {

        if (!e.IsFolder() && e.IsSelected()) {

          folder = e.FullPath().GetFilePath();

          if (last != folder && !list.Find(folder)) {

            new NameListClass(list, folder);

          }

          last = folder;

        }

      }

      foreach (NameListClass f in list) {

        Console.WriteLine(f.Name());

      }

    }

  }

}

This code uses a flat NameListClass for keeping a unique list of the folders and NameListClass::Find() does a sequential scan of the list. But, it still runs in a reasonable time.

The trick is that the script remembers the folder of the last previous entry. Chances are the current entry is in the same folder, so the script doesn't need to scan the list. In general, when you're looping over a tree with forall(), there's probably an ordering principle to the data in the tree. It's easy to cache a variable or two from the last item processed, and it can save yourself a lot of work. If the cache misses, then, yes, your code is slow... but it's also simple and fast to write. Obviously it's not appropriate for every situation, but it's a good trick to keep in mind.

There's a new EnScript filter that Lance blogged about, written by Oliver Höpli. The filter will connect to a relational database and then only show files in EnCase whose hash values match what's in the database. It's pretty useful, you should grab it.

There's a filter that ships with EnCase which does the same thing. Ever have a little mistake slip by you, and then find you have to live with it forever? That's what happened with that filter, and I'm sorry: it's my fault. As Oliver, and others point out, the filter maintains its set of hash values as a flat NameListClass. For every file, it looks through the list for that file's hash. If you have n files and m hash values, that means the filter will run in O(nm) time. If n ≈ m, then that means it runs in O(n²) time; it's quadratic.

While the filter is simple, that's not enough to account for the fact that it takes an unacceptable—and needlessly so—amount of time to run. So, a long time ago, I wrote a different version, that uses a binary search tree, built out of NameListClass as well, to do the lookups. This reduces the running time of the algorithm to O(n log n). As Lance points out, log n of 1,000,000 is approximately 20 (if 2 is the base of the logarithm, not 10). If you have a million files, it will take 20,000,000 operations to run the filter using the binary search tree. That's a lot better than 1,000,000,000,000 (1 trillion) operations.

In the comments, someone posted the code from the filter. It relies on the binary search tree implementation in GSI_Basic.EnScript, that's in the default Include/ folder. As it turns out, the better filter had once shipped with EnCase (I think; I may be wrong), but the filters were accidentally replaced during v6 development, prior to its release. The sad thing is that I knew about it, and should have spent 30 minutes to fix it, but it was always lower priority than other problems. Unfortunately, what would have cost me 30 minutes has ended up costing everyone weeks (months? years?) of processing time.

(As an aside, this is a good example of why life shouldn't be handled as a priority queue. Sometimes there's low-hanging fruit that you can't get to if you always work on the highest priority problem, and that low-hanging, but lower priority, fruit might be tasty and nourishing.)

While it's true that the binary search tree approach is faster than Oliver's filter, I'd encourage you to use Oliver's filter instead. While a database will almost never be the fastest approach, it won't be bad. Compared to hours and hours of processing time for the quadratic filter, the difference between 3 minutes and 20 seconds is negligible. And for that additional 2 minutes and 40 seconds of trade-off, you'll get the ability to use more hash values than you have RAM, something the binary search tree filter can't handle, and you'll be able to manage your hash values conveniently in a database.

(If you have billions and billions of hash values, more exotic solutions may be necessary.)

The binary search tree filter isn't even the best possible way. The code on Lance's blog converts every hash value to a string and relies on lexigraphical sorting of hash values. This adds a lot of overhead. It performs a lot of memory allocations since a new NameListClass object must be allocated for every hash value. These objects may be located in different parts of RAM, so you lose the benefits of L1 and L2 cache. What's more, you ignore some of the properties of hash values themselves... you end up spending too much time maintaining the order of the binary search tree when you could take advantage of how a hash works.

For example, what you could do is take a guess at the number of hash values you'll have. For Unique Files by Hash, you know you'll have an upper bound of the number of files in the case. So, create a MemoryFileClass buffer. Multiply 16 (MD5) by the total number of files, and then add an extra 50%. Now divide that buffer into 65,536 blocks. Instead of creating a binary search tree, put all the hash values that have the same first two bytes into each chunk. You'll have more of some than others, but since we added our fudge factor, you'll still have room, and it won't vary too much because of the cryptographic properties of MD5 (if things were lopsided, it'd be a bad hash function). At the beginning of the chunk, store the number of hashes in that chunk. Then sort the hashes in each chunk.

Now, as you go, you take the first two bytes of any hash value. Seek directly to that chunk of the MemoryFileClass buffer. Congratulations! You have now brought all those sorted, closely related hash values into L2 cache (if not L1). Additionally, you've eliminated roughly 99.998% of the possible hash values you have to consider. You now only need to binary search the hashes in the relevant chunk, and this will be fast because you're dealing with contiguous bytes in RAM, instead of objects that stores hashes as strings. This would be much much much faster than the binary search tree approach.

Or you can just use Oliver's filter, which I encourage you to do.

Codeslack

2010-11-23

Android Overview

2010-11-19

A Cheap Trick for quadratic loops

2010-11-17

Fast Hash Matching

my of-ness

search

Labels

archive

blogroll