2012-11-14

Representing Encoding Chains

Here's a problem I've been mulling over the last couple days: what is a good way to represent encoding chains?

What's an encoding chain?

When we search for text, it's always in some kind of encoding, something that specifies how logical characters are represented as "physical" bytes (a discussion of the physicality of bytes is best left for happy hour). For example, "A" is represented as byte 0x41 in the ANSI or UTF-8 encodings but as the two-byte sequence 0x41 0x00 in UTF-16LE (aka "two-byte Unicode") and UCS-2 (aka "fake two-byte Unicode"; never use UCS-2).

To search for a keyword in a file, we either need to know the encoding used for the file and convert the bytes in the file to logical characters to perform the comparison to the keyword or we need to convert the characters in the keywords into their binary form using the file's encoding and then search for those bytes in the file. Most text editors do the former; Lightgrep does the latter, as we don't usually know the encodings used for files, especially when searching fragments in unallocated space and whatnot.

An encoding chain goes beyond a single encoding: it's what happens when we want to deal with data that's been subject to a series of transformations, including a normal characters-to-bytes encoding.

Examples

A useful artifact in forensics is the UserAssist registry key. These keys store recently run programs, recording the name of the program, when it was last run, and the number of times it was run. A bewildering aspect of values under the UserAssist key is that the names of the programs are ROT13 "encrypted". This means that "A" becomes "N", "B" becomes "O", and so on.

ROT13 seems like a silly form of obfuscation... until you need to search for UserAssist registry keys pertaining to a particular program. Many forensic tools will parse registry hives and some will even decrypt the UserAssist keys for you, but if you'd like to look for traces of keys in the pagefile or in unallocated, you are out of luck when it comes to tool support. You must manually apply ROT13 to the name of the program ("notepad" becomes "abgrcnq") and then apply the UTF-16LE encoding to it ("abgrcnq" becomes "a\x00b\x00g\x00r\x00c\x00n\x00q\x00"). Tedious, no?

Another example (featuring more love for Didier Stevens) is when looking for obfuscated strings in malware. XORSearch is a program Messrs. Stevens recently released which lets you search for strings that have been obfuscated with XOR, ROR (rotate-right), or ROL (rotate-left) transformations. These types of transforms are often used by malware authors to obfuscate strings like URLs.

When searching, it would be very nice to say "look for 'notepad' but in ROT13'd UTF-16LE" or "look for 'http://sirhaxal0t.ru/' in UTF-8 that's been XOR'ed". The permutation of "ROT13, then UTF-16LE" or "UTF-8, then XOR" is what I mean when I say "encoding chain." We want to apply a chain of encodings to our search terms.

Characters v. Bytes

ROT13 is a transformation that applies to characters, with "A" becoming "N" and so on. In contrast, XOR is a transformation that applies to bytes. Character transforms apply to the string first before it's encoded into bytes, whereas byte transforms apply to the bytes after the encoding has happened. Here's a table that clearly delineates transformations and encodings:

Character TransformsEncodingsByte Transforms
ROT13, URL encoding, HTML entities ASCII, UTF-8, UTF-16LE, EBCDIC, Shift-JIS, ISO-8859-5, etc. XOR, ROR, ROL, Outlook Compressible Encryption, Base64

You may have noticed that there are a few constraints to the encoding chains we'd like to apply. First, there's an ordering constraint in that character transforms must come first, followed by the encoding, and finally by byte transforms. An encoding chain like "XOR -> ROT13 -> UTF-8" is invalid and nonsensical. Second, while it's possible to have multiple character transforms or byte transforms in a chain, there can be only one encoding. An encoding changes the type of thing we're working with from characters (ideally Unicode codepoints) to bytes, and that type change can only occur once.

The Ultimate Problem

We've been working on adding support for Outlook Compressible Encryption to Lightgrep, and that's necessitated thinking about encoding chains and how to expose the abstraction. At this point the main technical work is done and we are arguing about how to represent the encoding chains.

Whither:

"outlook(base64(utf8(rot13())))"

or

"ROT13 | UTF-8 | Base64 | Outlook"

or

"ROT13 > UTF-8 > Base64 > Outlook"

?

We'd like to have a good notation for specifying these chains. One interesting thing to note is that some transformations may want parameters. For example, ROR and ROL need to know how many bits of shift to apply. It would be silly to have ROR1, ROR2, ROR3, ROR4, ROR5, ROR6, and ROR7, when we could just have "ROR(2)", for example. With parameters, we could come up with a generic rotate character transformation, and then ROT13 would be "ROT(13)". (If you were investigating a classically trained hacker, you may want to look for text obfuscated by the Caesar cipher, i.e., ROT(3).)

So...

...which representation above do you prefer? Or can you come up with something better?

* Addendum *

You can expect Outlook Compressible Encryption support in Lightgrep fairly soon, but it will be some time before you see base64. Base64 is not so much a byte transformation as it is a bit transformation, taking 6 bits of data and turning them into 8. Lightgrep needs some work in order to work well with bits.