2012-12-02

System Reconfig

A year and a half ago, I bought a new system. Meet Virgil:



Virgil's main specs:

  • 2x Xeon E5645 (2.4GHz, 6 cores, 12MB L3, hyperthreading—apiece)
  • the legendary EVGA SR-2 motherboard
  • 24GB RAM
  • a pair of WD RE4 2TB hard drives
  • in a Lian Li PC-V2120 case
  • all kinds of I/O ports and a ho-hum-but-CUDA-capable GPU
The SR-2 motherboard is a non-standard size and will not fit in a normal ATX case. It necessitated the mondo case. At the time, the SR-2 was about the only desktop board that allowed for dual Xeons (otherwise you needed to buy a true server; blech). A year and a half after purchase, it is still far more computer than I need and I sometimes feel bad considering how much it idles. With its cores at only 2.4GHz, some laptops can beat Virgil at single-threaded tasks, but Virgil excels at parallel compiles, data processing, and the like. With Xeon fans, Virgil sounds like a red-lined Ferrari when its under load.

Operating Systems

I first started off running Ubuntu, which I mostly liked. However, the upgrade to 12.04 failed completely—Virgil could not boot, and I had no interest in trouble-shooting the failure. Additionally, I found Ubuntu's Unity interface a little tiresome. I probably appreciated it more than most Linux users, but it involved a bit more mousing around than I wanted.

A colleague is an expert Fedora user, so I installed Fedora. I figure it'd be worthwhile to learn an alternate approach and I could always piggyback off his knowledge. However, I hated Fedora. Gnome 3 is a disaster, Xfce felt like 1998 all over again, and Fedora demanded far too much sysadmin time from me. I didn't want to re-learn how to setup Samba configuration files again, as if it was still 2002; I wanted to click a checkbox and have my home folder shared so I could access it from VMs without any fuss. This was possible on Ubuntu... not so much on Fedora.

Additionally, I gradually realized that the Fedora installer had messed up my RAID. Under Ubuntu, I'd configured the two drives to act as an LVM RAID-1. When I installed Fedora, it could not recognize the partition, which was hugely annoying, but it did seem to recognize the underlying LVM RAID. I was wrong, though: it turned the RAID-1 into a spanned volume. I no longer had a redundant 2TB drive; I had a 4TB drive with twice the probability of failure as a single drive and no redeeming performance qualities.

Of course, by the time I realized this (disclosure: I am really bad at IT/sysadmin), I'd already accumulated a little over 2TB of data, thanks in large part of VirusShare. It would not be so simple to go back to the RAID-1.

Reconfig

So I bought another 2TB WD RE4 hard drive and a 256GB OCZ Vertex 4 SSD. I had two main goals: first, I wanted to install Mint as the main Linux system and I wanted to do it in such a way that I could trash it, install some other system, and be up and running again in a day or less, without too much time spent sysadmin'ing; second, I wanted a RAID for my main data, and I wanted the volume somewhat separate from /home so I could simply remount it as a subdirectory in my home directory under a new system, perhaps issue a recursive chown & chgrp, and be good to go. (Fedora and Ubuntu use different numbering schemes for UIDs, of course; it should be possible to preserve one's UID from system to system with a little bit of work, but the underlying theme of this blog post is how I am too lazy to do such things).

I first installed the SSD in the system and installed the newly released Mint 14 (with the Cinnamon UI) on a single / partition. I then backed up all of my data to an external 3TB drive, installed the 3rd WD drive, and blew my old data away.

Although the SR-2 board supports various RAID options in hardware, I decided I wanted a software RAID so I could survive a motherboard failure. Virgil has compute power to spare, and these are only 7200 RPM drives so I shouldn't suffer significant performance loss by going the software RAID route. I first constructed an unformatted partition on each of the drives, leaving a small buffer before and after in case I someday need to add another drive that differs slightly in its sector count. I followed the instructions to construct a RAID5 array from the three partitions, according the Linux RAID wiki:
mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
This created a 4TB RAID5 across the disks, with a default-selected chunk size of 512KB.

XFS

I narrowed my choices for the filesystem to EXT4 or XFS. Since I have a mix of very large files (evidence and other archives) and very small files (source code), I wanted an extent-based filesystem. Despite its advanced feature-set, reading up on BtrFS does not fill one with confidence about its reliability. Of course, one can find data loss stories about EXT4 and XFS as well. In the end I decided to go with XFS as it is an older design that's been strenuously tested on far more advanced RAID hardware, but it's still under active development and maintenance. This is an intuitive choice more than anything.

So I issued the command:
mkfs -t xfs /dev/md0
and immediately was presented with an error that the 512KB chunk size was too large for XFS. It seems that the default chunk size of 512KB is a recent change in the mdadm tools, likely geared towards EXT4. However, the maximum chunk size that XFS can handle is 256KB and XFS developers still recommend a 32KB chunk size in the general case. So, I deleted the RAID and recreated it:

mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1 
mdadm --detail --scan >> /etc/mdadm/mdadm.conf 
mkfs -t xfs /dev/md0
mount /dev/md0 /mnt/raid
mkdir /mnt/raid/jon/data
ln -s /mnt/raid/jon/data /home/jon/data 
I now had my XFS RAID5 and had it easily accessible inside my home directory, while still having the overall volume available for other uses.

One thing I noticed in reading about Linux RAIDs is that many people use LVM on top of the raw RAID device and then create different filesystems on top of LVM. If one were managing a system with a few large RAIDs for a large number of users, I could see how LVM would be useful. However, I cannot see any utility for it when you just want a single RAID volume. There's likely no performance loss to using LVM, but it adds another layer of complexity... and I am too lazy for unnecessary complexity.

The Disposable System

The RAID solves my data problem, but what about being able to get up and running with another system in the future? This enters into the realm of configuration management. There are two new configuration management systems that are taking the open source world by storm, Puppet and Chef. Both are implemented in Ruby and both seem like good pieces of software. They're both geared towards sysadmins who need to manage many systems. Essentially what both do is take configuration files specifying what packages, files, user accounts, and other configuration details should apply to a given system, and then they make the system conform to that description.

The main differentiator between Puppet and Chef is that Puppet has its own syntax for its config files whereas Chef's config files are valid Ruby source code. What I want to do is simply track the packages I install over time, so that I can get them reinstalled quickly if I need to. After looking at the syntax for each of Puppet and Chef, I decided that Puppet would be slightly more simple.

So, what I have been doing is building a directory of Puppet config files, each containing a list of software packages. The config files are roughly categorized, so I have one for basic C/C++ development, another for Python, one for system tools, one for GUI applications, etc. When I find myself in need of a package, I add it to the appropriate config file, and I can then bring my system into compliance with:
sudo puppet apply config-file.pp
Puppet figures out which, if any, packages are missing and then invokes apt-get to download and install them. If I switched back to Fedora, or to FreeBSD or some other crazy system, I would likely need to remap some of these package names slightly, but at least I'm tracking them. Additionally, I will sometimes create an EC2 instance to accomplish a specific task, and having categorized Puppet config files should help me bootstrap barebones instances in short order. I track my config files with git, natch, and mirror them on Github as I do with all my other source code.

Data has mass; I now have a RAID that should be able to survive a move to a different environment without having to make copies. And by using Puppet to reduce system management into, essentially, software development, I have reduced the mass of my operating system to almost nothing. All I need to do on a new system is get it to mount my RAID, drop my ssh keys, install Puppet, and then wait while all the rest of the software I depend on is downloaded and installed automatically.

etckeeper

Speaking of Git, another useful software package for anyone running Linux is etckeeper. etckeeper is a software package that interacts with your package management system and tracks changes to /etc. If you inadvertently screw up some aspect of your system via editing of various /etc config files, you can always go back in time to restore things. In this respect, it's sort of like System Restore, except that it works.
`

2012-11-14

Representing Encoding Chains

Here's a problem I've been mulling over the last couple days: what is a good way to represent encoding chains?

What's an encoding chain?

When we search for text, it's always in some kind of encoding, something that specifies how logical characters are represented as "physical" bytes (a discussion of the physicality of bytes is best left for happy hour). For example, "A" is represented as byte 0x41 in the ANSI or UTF-8 encodings but as the two-byte sequence 0x41 0x00 in UTF-16LE (aka "two-byte Unicode") and UCS-2 (aka "fake two-byte Unicode"; never use UCS-2).

To search for a keyword in a file, we either need to know the encoding used for the file and convert the bytes in the file to logical characters to perform the comparison to the keyword or we need to convert the characters in the keywords into their binary form using the file's encoding and then search for those bytes in the file. Most text editors do the former; Lightgrep does the latter, as we don't usually know the encodings used for files, especially when searching fragments in unallocated space and whatnot.

An encoding chain goes beyond a single encoding: it's what happens when we want to deal with data that's been subject to a series of transformations, including a normal characters-to-bytes encoding.

Examples

A useful artifact in forensics is the UserAssist registry key. These keys store recently run programs, recording the name of the program, when it was last run, and the number of times it was run. A bewildering aspect of values under the UserAssist key is that the names of the programs are ROT13 "encrypted". This means that "A" becomes "N", "B" becomes "O", and so on.

ROT13 seems like a silly form of obfuscation... until you need to search for UserAssist registry keys pertaining to a particular program. Many forensic tools will parse registry hives and some will even decrypt the UserAssist keys for you, but if you'd like to look for traces of keys in the pagefile or in unallocated, you are out of luck when it comes to tool support. You must manually apply ROT13 to the name of the program ("notepad" becomes "abgrcnq") and then apply the UTF-16LE encoding to it ("abgrcnq" becomes "a\x00b\x00g\x00r\x00c\x00n\x00q\x00"). Tedious, no?

Another example (featuring more love for Didier Stevens) is when looking for obfuscated strings in malware. XORSearch is a program Messrs. Stevens recently released which lets you search for strings that have been obfuscated with XOR, ROR (rotate-right), or ROL (rotate-left) transformations. These types of transforms are often used by malware authors to obfuscate strings like URLs.

When searching, it would be very nice to say "look for 'notepad' but in ROT13'd UTF-16LE" or "look for 'http://sirhaxal0t.ru/' in UTF-8 that's been XOR'ed". The permutation of "ROT13, then UTF-16LE" or "UTF-8, then XOR" is what I mean when I say "encoding chain." We want to apply a chain of encodings to our search terms.

Characters v. Bytes

ROT13 is a transformation that applies to characters, with "A" becoming "N" and so on. In contrast, XOR is a transformation that applies to bytes. Character transforms apply to the string first before it's encoded into bytes, whereas byte transforms apply to the bytes after the encoding has happened. Here's a table that clearly delineates transformations and encodings:

Character TransformsEncodingsByte Transforms
ROT13, URL encoding, HTML entities ASCII, UTF-8, UTF-16LE, EBCDIC, Shift-JIS, ISO-8859-5, etc. XOR, ROR, ROL, Outlook Compressible Encryption, Base64

You may have noticed that there are a few constraints to the encoding chains we'd like to apply. First, there's an ordering constraint in that character transforms must come first, followed by the encoding, and finally by byte transforms. An encoding chain like "XOR -> ROT13 -> UTF-8" is invalid and nonsensical. Second, while it's possible to have multiple character transforms or byte transforms in a chain, there can be only one encoding. An encoding changes the type of thing we're working with from characters (ideally Unicode codepoints) to bytes, and that type change can only occur once.

The Ultimate Problem

We've been working on adding support for Outlook Compressible Encryption to Lightgrep, and that's necessitated thinking about encoding chains and how to expose the abstraction. At this point the main technical work is done and we are arguing about how to represent the encoding chains.

Whither:

"outlook(base64(utf8(rot13())))"

or

"ROT13 | UTF-8 | Base64 | Outlook"

or

"ROT13 > UTF-8 > Base64 > Outlook"

?

We'd like to have a good notation for specifying these chains. One interesting thing to note is that some transformations may want parameters. For example, ROR and ROL need to know how many bits of shift to apply. It would be silly to have ROR1, ROR2, ROR3, ROR4, ROR5, ROR6, and ROR7, when we could just have "ROR(2)", for example. With parameters, we could come up with a generic rotate character transformation, and then ROT13 would be "ROT(13)". (If you were investigating a classically trained hacker, you may want to look for text obfuscated by the Caesar cipher, i.e., ROT(3).)

So...

...which representation above do you prefer? Or can you come up with something better?

* Addendum *

You can expect Outlook Compressible Encryption support in Lightgrep fairly soon, but it will be some time before you see base64. Base64 is not so much a byte transformation as it is a bit transformation, taking 6 bits of data and turning them into 8. Lightgrep needs some work in order to work well with bits.


2012-09-17

Forensic Programming Rules of Thumb

I have a set of mental guidelines I use when writing digital forensics programs, and it's high time I write them down. These are by no means a complete list of useful programming techniques, but I think they're all worth keeping in mind.

Algorithms

1. Don't be afraid to use RAM. Most investigators' machines have several gigabytes of RAM installed, so you might as well use it.

2. Remember, though, that RAM is fixed while disk space is nearly infinite. If you can't set an upper limit on the amount of RAM used by your program, create a cache data structure, size it appropriately, then spill to disk if you exceed it.

3. Prefer streaming models to building up complex data structures in RAM. Streaming models will generally use a fixed amount of memory and scale to large evidence sets.

4. Choose algorithms based on the best worst-case behavior, rather than the best average-case. Investigators don't need the best possible performance from their software as much as they need predictable performance, and the ability for the software to cope when faced with The Big One. An algorithm that has bad worst-case performance is not mission capable.

Timestamps

1. Convert local time to UTC as soon as possible.

2. Always report timestamps in UTC. I'm generally not a fan of programmers telling users what to do, but I'll make an exception in this case; when users ask for local time output, firmly reply "no" and tell them to get comfortable working in UTC.

3. Do not separate times from dates in output. Timestamps are points in time; separating times from dates imposes the periodicity of the earth's revolution onto your investigation, and that's often not germane. Even if it is germane, having dates and times separated makes it a huge PITA to calculate timestamp differences in, e.g., Excel.

4. Timestamps generally exist within a "context," for lack of a better word. Contexts are often cascading/recursive. For example, the context for an OLE timestamp with a Word 2003 document is "Word 2003 OLE", followed by the filesystem (e.g. UTC for NTFS, local time adjustment for FAT32), followed by the system time settings, followed by, say, the timezone of where the evidence was seized. It's important to remember these contexts when handling local time. Even if you convert local time to UTC, one of these contexts could later be found to be incorrect. Consider the timestamps as existing in a database; it would be better to query for the appropriate rows to adjust by this context then through some ad hoc mechanism.

Artifacts

1. Never assume anything about your input. Ooh, you found a magic header for your artifact? So what!? The data could have been generated by a cuckoo clock. Sanity check everything, especially when reading in variable length buffers/strings with sizes denoted in other fields. Return true or false depending on whether the artifact parsed successfully. If it mostly parsed but failed some kind of test, return the partial results along with false.

2. Keep your artifact parsing code separate from your I/O code (if possible; better to operate on provided memory buffers and let other code feed it data) and separate from your reporting code. This way you can foster code reuse when someone else needs to parse the same artifact but in a different context.

3. Put the immediate results from parsing into a struct or, perhaps better, a name-value data structure, such as a Python dict. Don't serialize into a string or otherwise obliterate the type information you have at that moment; that's for your output code to handle. Again, this fosters code reuse.

4. Always refer to bytes on disk. When you output parsed artifacts, always give the user reference to where that artifact existed on disk. This way the user has a chance in hell of finding it for themselves and validating that your software parsed it correctly.

5. Remember that parsing is not the same thing as analysis. The output from parsing an artifact should be in a form that makes it convenient for the examiner to analyze. This could mean, for example, that it's easy to get the output into Excel or Access or pipe it into different command-line utilities. It's way better to produce output that is well-structured than pretty—well-structured data can always be prettied up, while it's often hard to recreate structure from pretty data.

6. It's better to identify an artifact by searching for it than by assuming it exists within a well-structured file. If you can identify artifacts by searching, then you can parse them out of RAM dumps, virtual memory swapfiles, and unallocated space.

2012-08-24

Happy Birthday, Mr. Pigeon

Yesterday, I asked a question on Twitter:

I will admit, at the time I was hoping that one of the respondents would be Jesse Kornblum. I was not disappointed. There was a little back on forth on twitter, and then Jesse decided to write a blog post, "Being Unreasonable About MD5." In it, he advised me to use a SHA-2-based hash, or wait until SHA-3 is determined.

If using awesome Rocky IV images in blog posts is wrong, I don't want to be right.
DFIR Slugfest!
It being 2012, we should all know that MD5 is "broken." The first MD5 collision was announced in 2004 (some analysis [PDF]). The ability to generate collisions since then has only gotten better. Earlier this summer we were treated to the sheer awesomeness of Flame exploiting an MD5 collision which could not have been generated based on publicly known techniques (*cough* USA!, USA! *cough*).


Here's the problem, though: I'm looking for a hash function that I can use for lazy equality comparisons, not for use in digital signatures. That is, I want to see whether a new string of input I've read in from a program is the same as a previous string of input I've read in.

Why do I want to do this? Well, I want to grep hard drives for email addresses and spit out a list of all the unique ones I've seen. I also want to do this with web addresses/URLs, credit card numbers, Social Security Numbers, and so on. One possible way of doing this (a really lazy way, and it might not even be the best way; that's a separate story altogether) is to hash all the email addresses with MD5 and call it a day. MD5 will surely work for this, won't it?

...

In 1834, Johann Dirichlet postulated what has come to be known as the Pigeonhole Principle. It is one of those theorems in mathematics where the proof of it is simply common sense: if you have n pigeonholes, and n+1 pigeons, you're guaranteed to have 2 pigeons shacking up.

Feral Pigeon (Columba livia domestica)

MD5 is a 128 bit hash algorithm. Thus, it has 2^128 different pigeonholes. According to the pigeonhole principle, we need to have (2^128) + 1 items before we're guaranteed to have a collision.

In case you're wondering, 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456.  There certainly aren't that many unique email addresses we're ever going to encounter, so we should be good right?

Well, actually, let's take a moment to guess how many unique email addresses there are. I don't have a central repository of all the email addresses ever used, but I can use myself as an example and estimate out from there. Here are all the email addresses I can remember having:


  1. (I don't remember the first one, but it was @nebbs.nersc.gov on a Cray X-MP, which was rad.)
  2. beefstew@aol.com (yes, really)
  3. stew1@iastate.edu
  4. stew1@avalon.net
  5. jonathan.l.stewart@uiowa.edu
  6. jstewart@cs.uiowa.edu
  7. stew1@ellipsis.cx
  8. jonathan_l_stewart@hotmail.com
  9. jonathan_l_stewart@yahoo.com
  10. jonathan.l.stewart@gmail.com
  11. jonathan.l.stewart@acm.org
  12. jonstewart17@comcast.net (DIAF, Comcast)
  13. jon@lightboxtechnologies.com
Lucky number 13. Of course, I may have used a few burner email addresses over the years and there could be a couple of legitimate ones I've forgotten, too. So, let's round that up to 50, just for sake of argument.

Also, let's assume that I'm the most average person on Earth. How many people are there? Supposedly 7.035 billion, as of today, August 24, 2012. Extrapolating out, we could be dealing with 351 billion email addresses, if every single person on the planet has had 50 email addresses. However, there are almost certainly email addresses that are associated with computers and departments and mailing lists and such, not just people. So, well, let's double that: 702 billion email addresses. That's our estimated population.

Right off the bat, we can see that those clinging to 32-bit integers are going to need a bigger boat. If I happened to have a hard drive with all the email addresses on it and I was using a 32-bit hash, the pigeonhole principle tells me I'd have a lot of collisions.

Still, though, MD5 only has to deal with 702,000,000,000 email addresses and fit the hashes from our email addresses into 340,282,366,920,938,463,463,374,607,431,768,211,456 possible values. Continuing along the primrose path of overly simplistic math, this means we should have a probability of 702 billion/2^128, which is 2.062992585693e-27, or 1 chance in 2.06 octillion. That's a lot of room for MD5 to play in.

Nyet so fast, Komrade!
However, thus far I've failed to take into account the Birthday Paradox. Have you ever been in a group of people and been surprised to find that two of them have the same birthday? As it turns out, odds are even that two guests at a 23-person dinner party will have the same birthday and 99% that two people in a group of just 57 will have the same birthday. This is because we are comparing pairs of birthdays to each other, each person trying to pair themselves to all the rest. The number of pairs of n things is (n*(n-1))/2. As people start piling up, so do the combinations and the number of days in a year are quickly overwhelmed.

The Birthday Paradox leads to the Birthday Attack: start testing for collisions by brute force and, over time, you will get lucky and have a pair collide. The number of items we have to test, Q, before finding the first collision in a fair hash function is roughly equal to the square root of (pi * H)/2, where H is the total number of available pigeonholes. For MD5, H=2^128 and thus Q = sqrt((3.1415926 * 2^128)/2). What is that number?



It's ~23,119,565,135,021,060,000. That is, 23 quintillion. That number, while large, is a lot smaller than 340,282,366,920,938,463,463,374,607,431,768,211,456. That's the number of strings we have to hash before we have close to a 100% chance of having a hash collision on MD5. What about a 1% chance? Well, fortunately, Wikipedia calculated that number in the table that you see under the Birthday Attack article. For a fair hash with 128 bits, if we try hashing a tenth of that many items, the probability we'll have a collision is 1 in 100. If we generate ~820,000,000,000,000 strings (820 trillion), the chance we'll have a collision is 1 in a billion. If we generate 1000 times fewer strings, ~820,000,000,000 (820 billion), the chance we'll have a collision is 1 in a quadrillion.

820 billion is pretty close to the 702 billion distinct email addresses we estimated above. So, if MD5 is fair, we can expect we have about a 1 in quadrillion chance that we'll have a collision.

Have you noticed how fast and loose I've played with the math here? And how much I've relied on Wikipedia? Disclaimer: I am really bad at math. When trying to figure out math like this, I generally reserve the right to be off by a factor of 1000. Therefore, you only get to correct me if you can show that I'm off by more than a factor of 1000.

Additionally, all these numbers above assume that MD5 is perfectly fair. In fact, we know that it isn't, because of the collision attacks. If I reserve the right to be off by a factor of 1000 and we know that MD5 is also kind of caput, that means we need to estimate our probability to be somewhat higher. So, one in a trillion? One in 100 billion? Maybe a little less?

Now we're in judgment call territory. For my application of deduping email addresses, well, meh. I'll certainly take the odds of a collision when hashing all of the email addresses ever at one in quadrillion, and even at one in a trillion. The nice thing about using MD5 for this is that it's reasonably fast (Xeons even have a special instruction for calculating MD5 on the chip) and the keys will fit in 16 bytes, the size of an SSE2 register.

However, what about all possible files? Are there more possible files than email addresses? Probably. What about all distinct 512-byte sectors, for those who like sector block hashing? Almost certainly. Are we starting to bump up into limits where we have to think about whether it's safe or not? Yes. Is that a bad thing? Yes. Like Rocky, we do not want to think, we do not want the judge to think, we do not want the jury to think; we want to win.

So, unless you want to think about all these things above, including whether you want to explain one in a trillion, then you should stop using MD5.

Although, to keep things in perspective, the risk of everyone dying next year due to a comet or asteroid strike is only 1 in 20,000.

 ...

Ok. Now you may correct my shoddy math in the comments.


2012-08-11

Random Sampling in Forensics

(NB: Some links are to PDFs.)

I had the pleasure of attending the Digital Forensics Research Workshop this past week. DFRWS is different than the other forensics conferences I've attended, in that speakers must submit formal academic papers for review by a committee. Less than one-third of submitted papers make the cut.

This is how most academic conferences work, and there are a few consequences stemming from the emphasis on peer-reviewed papers. One is that the quality of the material is generally high. This is good. Another is that many leading practitioners—who, as forensics generally demands, also double as researchers—don't present, since writing a formal academic paper is an enormous PITA. This is less good.

I enjoyed a number of presentations at DFRWS this year (chief among them was Ralf Brown's about language identification). The one that stood out for me, though, was a blend of theory and practice. This was The Use of Random Sampling in Investigations Involving Child Abuse Material, presented by Mike Wilkinson and co-authored by Brian Jones and Syd Pleno.

What Mike, Brian, and Syd did is, frankly, revolutionary. They established procedures at the New South Wales police force's State Electronic Evidence Branch so that child pornography possession cases make use of statistical sampling early in the analysis of the case. By using sampling, they have cut their case backlog from 3+ months to zero.

By using an EnScript that selects a random sample of images from suspects' hard drives, investigators can estimate the total amount of child pornography on the system with a high degree of confidence. A report based on the sampling review is created and, while situations no doubt vary, the report is typically presented to the suspect along with a plea bargain... and most take the plea. Such cases take only a few hours of processing.

Of course, such a methodology is only appropriate for possession cases, not distribution or where active child abuse is suspected. Digital investigators benefit because they don't have to sit for hours reviewing thousands (or millions) of corrosive images, and they have more time for other cases. Other investigators benefit because their computer evidence gets turned around faster. Even the falsely accused benefit, because they are quickly exonerated.

Part of why the procedure works is because it's well-adapted to the law in New South Wales, Australia. However, with care and consideration, it is hard to believe that the basic idea could not be applied in the United States.

I encourage you to read their paper and review their presentation. Kudos to Brian, Syd, and Mike, and to the New South Wales police for adapting to the increasing strain on digital investigators with an elegant and effective solution.

2012-06-14

Lightgrep 1.0

I've just released Lightgrep for EnCase 1.0. Lightgrep is a new multipattern regular expression search engine that I've been developing with my colleagues over the last two years, and our EnCase integration makes it the fastest way to conduct searches over digital evidence.

I'm hoping I'll be writing more blogposts now, as I allow my mind to wander a bit more.

2012-05-19

Lightgrep for EnCase at CEIC 2012

I'll be at CEIC 2012 next week, arriving Monday and leaving mid-morning on Thursday. Geoff Black and I will be talking again about the use of statistical sampling in eDiscovery matters, based on a matter we worked on in which the Hon. Andrew J. Peck was the presiding judge. Our talk is at 4:30pm on Monday.

We are also putting the finishing touches on Lightgrep for EnCase and we'd love to show it to folks attending CEIC. Lightgrep is a new regular expression search engine we wrote, that searches evidence for many keywords very quickly. We've created an EnScript wrapper around the low-level search engine that allows you to use it instead of EnCase's own keyword search facility. It will bookmark search hits and create an Excel report with details about the search, in addition to several other features.

Lightgrep shines when you have many keywords or need more advanced grep functionality than EnCase provides. Last year we had some investigators use an early version on a case with over a million keywords, so we think it sets a new standard for scalability. Lightgrep also supports more grep features, and we try to make them work as close to Perl as possible. We don't have all of Perl's functionality just yet, but we'll keep adding more grep operators over time (and we do a lot of testing to make sure we get the same hits Perl does).

Unicode support is what we've been working on most recently, and I'm pretty excited about it. Out of the gate, we'll have support for ASCII (really, Windows code page 1252), UTF-8, UTF-16LE (two-byte Unicode on Windows), UTF-16BE, and UTF-32LE and BE. We are able to look for every character in the Unicode standard, including brand new ones in Unicode version 6.1.0, like the infamous U+1F4A9. We won't support other code pages just yet, but the code is mostly developed and we simply need our test suites to catch up.

The cool thing about Lightgrep's Unicode support is that you can use Unicode properties in patterns. You can specify \p{Digit} or \p{Letter} and look for only valid digits or letters, but in any language. You can also specify different scripts, which is amazingly useful when looking for text that's not in the Latin alphabet. For example, \p{Cyrillic}+ and \p{Arabic}+ will find all words in Cyrillic (i.e., Russian) and Arabic, respectively.

If you'd like to see Lightgrep in action at CEIC, just email me, at jon@lightboxtechnologies.com, and we'll find a good time to chat. I'll have a few dozen thumb drives with a Lightgrep trial version installer, so don't be shy.