Random Sampling in Forensics

(NB: Some links are to PDFs.)

I had the pleasure of attending the Digital Forensics Research Workshop this past week. DFRWS is different than the other forensics conferences I've attended, in that speakers must submit formal academic papers for review by a committee. Less than one-third of submitted papers make the cut.

This is how most academic conferences work, and there are a few consequences stemming from the emphasis on peer-reviewed papers. One is that the quality of the material is generally high. This is good. Another is that many leading practitioners—who, as forensics generally demands, also double as researchers—don't present, since writing a formal academic paper is an enormous PITA. This is less good.

I enjoyed a number of presentations at DFRWS this year (chief among them was Ralf Brown's about language identification). The one that stood out for me, though, was a blend of theory and practice. This was The Use of Random Sampling in Investigations Involving Child Abuse Material, presented by Mike Wilkinson and co-authored by Brian Jones and Syd Pleno.

What Mike, Brian, and Syd did is, frankly, revolutionary. They established procedures at the New South Wales police force's State Electronic Evidence Branch so that child pornography possession cases make use of statistical sampling early in the analysis of the case. By using sampling, they have cut their case backlog from 3+ months to zero.

By using an EnScript that selects a random sample of images from suspects' hard drives, investigators can estimate the total amount of child pornography on the system with a high degree of confidence. A report based on the sampling review is created and, while situations no doubt vary, the report is typically presented to the suspect along with a plea bargain... and most take the plea. Such cases take only a few hours of processing.

Of course, such a methodology is only appropriate for possession cases, not distribution or where active child abuse is suspected. Digital investigators benefit because they don't have to sit for hours reviewing thousands (or millions) of corrosive images, and they have more time for other cases. Other investigators benefit because their computer evidence gets turned around faster. Even the falsely accused benefit, because they are quickly exonerated.

Part of why the procedure works is because it's well-adapted to the law in New South Wales, Australia. However, with care and consideration, it is hard to believe that the basic idea could not be applied in the United States.

I encourage you to read their paper and review their presentation. Kudos to Brian, Syd, and Mike, and to the New South Wales police for adapting to the increasing strain on digital investigators with an elegant and effective solution.


  1. Hey Jon thanks for the write up, revolutionary indeed! Brian is the one who did all the enscripting, and many of his scripts can be found in the guidance forums here:https://support.guidancesoftware.com/forum/downloads.php?do=cat&id=185 if people are interested in his sampling script he can be contacted through the guidance forum, or contact me and I will pass the request along.

    To my mind with the vast volumes we are now dealing with we need to be moving away from the model of image everything and start being far more targeted in what we collect and analyze. In order to do this effectively we need to develop techniques that will allow us to quickly identify storage devices containing evidence and eliminate those that do not.

    1. Correction, I should have said that both Brian and Syd did the coding. At the time this system was being developed they were our R&D team, until Syd moved onto the AFP.

  2. In general, I disagree on targeted collections for forensics (targeted collections for eDiscovery are a no-brainer), for several reasons:

    * The pain investigators feel from large evidence files generally stem more from outmoded imaging practices (raw DD images vs fast-compressed EWF), outmoded hardware (USB2), or outmoded processing tools than from the inherent difficulty of imaging large drives. Folks need to embrace fast compression, get better hardware, and, especially, get better tools. Multicore processors have been around for nearly a decade, but a number of processing tools sadly make you forget that.

    * Unallocated space holds so much awesome stuff! I know there are legal challenges in Australia with ICAC cases, but lots of other cases can be broken with artifacts in unallocated space.

    * The more data the computer has to work with, the _less_ the investigator should have to review. This is counterintuitive and I know it isn't true right now, but someday it will be. With more evidence, software should be able to build more sophisticated models of ham vs spam, and highlight only the most relevant artifacts for investigators.

    One of the main reasons why the statistical sampling method works is because there is so much contraband material in ICAC cases. With other cases, though, the relevant artifacts tend to be much more rare. To have a chance of finding them, we'll need to have as much data as possible for computers to take into consideration.