2012-05-19

Lightgrep for EnCase at CEIC 2012

I'll be at CEIC 2012 next week, arriving Monday and leaving mid-morning on Thursday. Geoff Black and I will be talking again about the use of statistical sampling in eDiscovery matters, based on a matter we worked on in which the Hon. Andrew J. Peck was the presiding judge. Our talk is at 4:30pm on Monday.

We are also putting the finishing touches on Lightgrep for EnCase and we'd love to show it to folks attending CEIC. Lightgrep is a new regular expression search engine we wrote, that searches evidence for many keywords very quickly. We've created an EnScript wrapper around the low-level search engine that allows you to use it instead of EnCase's own keyword search facility. It will bookmark search hits and create an Excel report with details about the search, in addition to several other features.

Lightgrep shines when you have many keywords or need more advanced grep functionality than EnCase provides. Last year we had some investigators use an early version on a case with over a million keywords, so we think it sets a new standard for scalability. Lightgrep also supports more grep features, and we try to make them work as close to Perl as possible. We don't have all of Perl's functionality just yet, but we'll keep adding more grep operators over time (and we do a lot of testing to make sure we get the same hits Perl does).

Unicode support is what we've been working on most recently, and I'm pretty excited about it. Out of the gate, we'll have support for ASCII (really, Windows code page 1252), UTF-8, UTF-16LE (two-byte Unicode on Windows), UTF-16BE, and UTF-32LE and BE. We are able to look for every character in the Unicode standard, including brand new ones in Unicode version 6.1.0, like the infamous U+1F4A9. We won't support other code pages just yet, but the code is mostly developed and we simply need our test suites to catch up.

The cool thing about Lightgrep's Unicode support is that you can use Unicode properties in patterns. You can specify \p{Digit} or \p{Letter} and look for only valid digits or letters, but in any language. You can also specify different scripts, which is amazingly useful when looking for text that's not in the Latin alphabet. For example, \p{Cyrillic}+ and \p{Arabic}+ will find all words in Cyrillic (i.e., Russian) and Arabic, respectively.

If you'd like to see Lightgrep in action at CEIC, just email me, at jon@lightboxtechnologies.com, and we'll find a good time to chat. I'll have a few dozen thumb drives with a Lightgrep trial version installer, so don't be shy.