Codeslack

2023-09-20

Running Rules — A Rough Ranking

A caveat: this post is more brain dump than research. I would not be surprised if multiple academic papers have laid out competing hierarchies and frameworks for classifying rules engines. However, doing research is hard. It takes me days of searching to find good papers, and then it takes me weeks of re-reading to grok them. So what follows is entirely NIH.

Problem definition

Say you have a big pile of data. There's a file, it has some records in it. The records are serialized in some format, like json, protobufs, pcaps, or, more ridiculously, binary xml records inside of a chunked binary file format (how I miss LfLe!). Maybe there's some compression involved. And what your boss/department/company/industry would like to know is whether individual records in that data set match any number of rules in a set and, if so, which ones.

Note how uncomplicated of a task this seems. We don't need to sum any of these fields across records, compute statistics on them, join them with lots of other data sets (well, we do and will; cyber calls this phase "enrichment" but it happens much later), or such. Our problem is, for each of n records, which of our m rules matches on the record?

(Note also how this is actually a join, but since we're so often processing complex records in complicated formats as it lies and using rules with a rich and complicated set of predicates, refining the computation to the point where it can be performed as a join is damn near impossible.)

Rules

Ok, records, sure, records are easy. Tuples of fields, maybe there's some typing, maybe the fields have names, maybe there are complex types, but, still, we all understand these records: they're logs, CSV, JSON, maybe protobufs/Avro/Thrift or pcaps. What are the rules?

A rule could be very simple, e.g., "Value = 5". A simple evaluation rule like this can be called a predicate; it's performing a single evaluation against a single field in the record and the result of the predicate is either true or false. Predicates can be joined together into complex rules with Boolean logic. The operators in the predicates will likely permit equality checks and comparisons. Often there will be functions (which are also just operators without fancy syntax) like "contains", "in", "like", "matches", which help with strings and sets. Perhaps if you're unlucky, you'll be working in an extensible rules language where a predicate can involve parsing kilo/megabytes of raw data to compute some fields on the fly. Then you'll tire of such complexity, and the accompanying slowdown in runtime and development time that comes with it, and you'll adopt the hot new rules language that seems simpler and friendlier, the kind assembled by field practitioners who can't be bothered with knowing that describing computer language syntax was solved 64 years ago.

Rules evaluate different fields in the records, often comparing them against literal values. It may be possible for rules to compare different fields in the same record, e.g., "record.created_timestamp < record.lastmodified_timestamp", though this is rare. What we will not consider is the ability for rules to compare/evaluate multiple/different records with possible aggregations. That's what SQL is for.

Implementation types, ranked

So, how do we write a program to run the rules against our data? What follows is a ranking of implementation approaches I can envision, from least optimized to most. This ranking primarily considers throughput, i.e., how many records can evaluated against a given set of rules in a measure of time.

Ad hoc

Rules seem hard and abstract. Unrestricted Boolean logic? Comparisons? Code Monkey think this is hard! (Code Monkey output not functional or elegant.)

If rules can be restricted to simpler and specialized forms, then an ad hoc rules engine may only need to consider hard-coded logic for looking up hash values in a hash set, parsing PE file headers, or conducting string searches. Perhaps a few different types of rules can express a small number of commonly desired Boolean forms, so a rule could simply OR constituent ad hoc evaluation functions, or maybe AND then, or negate them. I did this with EnCase eDiscovery/EnCase Cybersecurity, where the user could choose from three different Boolean forms for combining logic on filesystem metadata, file content search terms, and file hashes. Those three options handled all the practical use cases.

Parse trees

Ad hoc engines are pretty limited, however, and parsers aren't that hard, once you know how to write one. So, if you can parse formal rules, with Boolean logic, then you can build a parse tree data structure. And if you can build a parse tree, then you can execute it according to the types of nodes in the parse tree. The engine will generally involve a stack data structure and one or more switch statements for executing what to do for each type of node in the parse tree (e.g., an OR operator). Sometimes the parse tree may be turned into an abstract form, with enums and objects; other times it'll just be character literals that are being compared directly.

Performance will vary amongst implementations, and especially depend on the implementation language/platform. An engine written in Python will be a lot slower than one written in Rust or C. Still, throughput may only be 10MB/s or less, and will degrade considerably as the rules add up — each rule will need to be parsed and evaluated, and there's usually not any work done to consolidate logic or even just useful setup work between rules. If parsing is not a separate step from evaluation, then the parsing logic itself will need to be re-evaluated with each record.

A possible way out of writing a proper parser and evaluation engine is to recast the rules into some other language that can be embedded and executed itself; I also did this in EnCase eDiscovery/EnCase Cybersecurity where the EnScript engine itself was exposed as an API to EnScript, and then I could just generate EnScript snippets on the fly as EnCase Filters and execute that code against objects. And when I first started thinking about evaluating rules, in 2005, I realized that if the industry was smart it would simply write its rules in Prolog and life would be easy.

Did the industry adopt Prolog?-

(That was a Prolog joke.)

Instead the industry first adopted OpenIOC (yay xml) and then whatever-the-hell YARA is. And now there's sigma, which is even less well-defined than these two predecessors.

Examples: Hayabusa, chainsaw (a bit more elegant)

Virtual machines

If you can generate a parse tree and you refactor your code enough, you'll often find you have a loop that pops nodes off a stack, evaluates those nodes, and then pushes them back onto the stack until the stack contains only one element and there's nothing left in the rule to parse: that one element then is the result.

If you have that code, then it is not a big leap to represent those nodes as invented byte code, or p-code, and then to execute that byte code. As it happens, this leap can make the evaluation engine astonishingly straightforward — there's a loop around decoding the current instruction where the opcode's evaluated by a switch statement and the cases perform the operation. The complexity of direct parse tree evaluation can be spread out across a separate compilation step, which transforms the parse tree into the byte code.

Generally these virtual machines fall into two classes: stack-based and stackless. Stack-based virtual machines are typically simple and low-level, where the operations are extremely simple and evaluation follows the parse tree representation. Stackless VMs may use a byte code representation that is higher-level and domain specific, with instructions that correspond more directly to complicated functions (e.g., "contains" resulting in a string find). This is not a precise distinction, of course; the precise distinction is that stackless VMs lack a stack... but lacking a stack often results in greater complexity being put into the operations.

Virtual machines have some nice properties that bring performance benefits. Decoding and evaluation results in simple code, and optimizing compilers love simple code, which they can tease out into sequences of instructions that pipelined CPUs can evaluate efficiently. There's often far less branching in the resulting code, so there are fewer stalls. And critically, the byte code tends to be compact and stored altogether in memory, resulting in excellent L1 cache hit characteristics.

With a separate compilation step, the virtual machine approach can also be amenable to optimization passes. For example, across a set of rules there may be duplicate predicates. A rule engine that understands all the rules at once could deduplicate evaluation of these predicates. It could also reorder evaluation of predicates to try to prove rules don't match (the common case) with early exits.

Lightgrep uses the virtual machine, in a stackless complicated way. It's based in no small part on Russ Cox's excellent blog post Regular Expression Matching: the Virtual Machine Approach. Whole programming languages can be implemented with virtual machines, too, like Python. Another notable example that's generated considerable work and blog posts over the years is Lua. One could do a lot worse than embedding Lua and calling it a day.

Other examples: jq, PCRE

JITs

If you're parsing rules and compiling them to byte code, well, why not compile them to machine code instead and make it the CPU's problem? This is called a JIT, for just-in-time.

Do you like speed? JITs can be blazing fast. They lower evaluation to the level of the machine, where there is no engine, i.e., getting your crappy Code Monkey software out of the way of the CPU is probably the best way to go fast.

But JITs also have a cost — welcome to tradeoffs! First, rules tend to be fairly compact, a few dozen characters. Those rules can unroll to hundreds or thousands of machine code instructions. Invented virtual machine byte code can still be compact, but machine code deals with registers, raw arithmetic, and loads/stores. So, typically, JITs will need more memory to represent the rules, and this can turn into a problem when you have a lot of rules.

Additionally, compiling the rules to machine code can take time, so there are larger startup costs. This is especially the case if your JIT has optimization passes. This often leads to faster evaluation, but could wind up being a false economy, where a simpler engine would get through processing all the data before a JIT even starts running.

What's more, an off-the-shelf engine with optimizers can itself be dozens of megabytes in size. You're going to embed LLVM into your tool? Good luck fitting that on a floppy and/or downloading to point-of-sale endpoints running on a cheap Comcast link.

Portability also becomes a concern. If you prudently decide not to embed LLVM, now you need to find a JIT that's portable if you want to support more than x86. Good luck! I'm sure there are solutions, but I don't know of a clear answer off the top of my head.

JITs also have security concerns. A normal executable gets compiled and it runs. Modern operating systems will treat its code and data separately, with the code protected by the virtual memory manager as read-only and the data treated as non-executable. With JITs, data and code are the same thing! The JIT will need to write out machine code to memory and then execute it directly; this requires some low-level operating system trickery.

Examples: PCREjit, luajit, redgrep, Ken Thompson's original regular expression search engine for IBM 7049

SIMDized

Modern CPUs have a lot of capability. Pipelined CPUs became common forty years ago. Thirty years ago CPUs went super-scalar. Twenty years ago CPUs went multi-core.

A rule engine that uses a JIT, especially a simple one without optimization passes, will struggle with pipelining as it's likely to embed many branches that mispredict with some regularity (a 90% branch prediction rate is still pretty bad). It will likely not execute with IPC rates > 1.0 (unless you're running on one of Apples magic M chips, that put jillions of transistors towards a monstrous ROB). And for sure you will not get multiple cores utilized, as multithreading is always an exercise left to the user (and rightfully so).

These three limitations need to be addressed in a rules engine if you want to fly as fast as the hardware can go. Good luck! I don't have good answers for you other than lots and lots of hard work.

And even if you are able to write a fantastic rules engine that JITs out code with high branch prediction and IPC rates, and uses threads smartly to process [batches of, you'll need to amortize your locking to avoid stalling] records in parallel, you still won't be using your CPU to its fullest potential.

18 years ago Intel released SSE3, and SSE3 quietly revolutionized SIMD optimization. It took awhile for folks to figure out the magic new pshufb instruction, but each new SIMD ISA since has added lots of capability (though ARM's NEON ISA is more comparable to SSE 4.2 or AVX, with 128 bit registers while just about everyone with an Intel CPU has at least 256 bit registers and many have 512 bit registers). Contemporary CPUs can now often retire multiple SIMD instructions per cycle, if they're scheduled correctly.

Predicates in rules evaluate to true and false. Rules combine predicates with Boolean logic. If you're thinking what I'm thinking, then finding ways to represent predicates and rules in SIMD bit vectors is likely to provide a lot of benefit.

However, it's easy to forget the MD in SIMD. A good SIMD engine will likely only gain high throughput by biting off a chunk of records, decomposing them into their fields, and evaluating those fields with SIMD operations in parallel, then, many instructions later, taking a look at the result and separating matching records.

One major trouble with adopting SIMD is portability. Even amongst x86 CPUs, there are multiple targets ISAs (AVX-512 is a family [and a Tolstoy-esque unhappy one at that] of differing ISAs, then there's AVX-2, AVX, SSE 4.2 for the Sandy Bridge diehards like me...), and different CPU microarchitectures may have very different performance characteristics for the instructions in those ISAs. And then there's ARM NEON. Oy.

Another major trouble with using SIMD operations in a rules engine is that the predicates in a modern cybersecurity ruleset are many, but most evaluate to false a great deal of the time. So a bit vector representation of all the predicates across all the rules will still be large (thousands of bits), much too large for a single register, and most of the bits will be unset. So, there's natural sparsity to cope with — maybe a bit vector representation isn't worth it? Another major trouble will be the dependencies between predicates in the rules; how do you benefit from early exit with a SIMD approach?

The only answer when it comes to SIMD optimization is empirical bloodymindedness.

Examples: none that I know of, naïve SIMDization is an uncanny valley of death

Multi-algorithm SIMD with ad hoc components

Bloodymindedness it is then. Relying on simple SIMD operations to evaluate sparse bit vectors with thousands of bits could lose out to a normal JIT engine, maybe even a VM. The way to apply SIMD is to look for amenable types of predicates and rules and target those with specialized SIMD algorithms.

Here Hyperscan reigns supreme and alone, and with its integration into Suricata pcap scanning has benefited from it for a number of years. Noseyparker also integrates Hyperscan for secrets scanning. But Hyperscan, as an Intel project, is unsurprisingly Intel-only. There's been an attempt or two to port Hyperscan to ARM, but Hyperscan's development itself seems tenuous and the porting attempts can be characterized as "academic" in comparison.

DuckDB is also worthy of mention here. As a columnar database, it is a good fit for SIMD operations, and it benefits greatly from them. Any Python programmer considering writing a rules engine should do everyone a favor and instead focus on ingesting their data into DuckDB. (I would also extend this recommendation to many C programmers who've written popular engines, as well as to those that naïvely bundle such engines into larger scanners.)

Further Considerations

Record parsing

Rules don't exist in a vacuum. After all, what we care about are the records. And records are inevitably represented in their own form (CSV, json, in-memory data structures, etc.). Dealing with those records and getting their necessary fields out to where they can be evaluated by predicates in our ruleset is not free. In fact, it may dominate the runtime, perhaps by orders of magnitude, especially if we're not careful with how the data's encoded (ugh cybox) and which libraries are used (🤡).

An interesting approach is simdjson. Simdjson uses SIMD operations in parsing json records in parallel, but it also embeds some [very] rudimentary rules capability with JSON Pointer support. That allows it to evaluate json records without first doing a full parse.

Avoiding full parsing has of course been a key aspect in regular expression searching. Fast regular expressions don't parse text into lines, parse lines into Unicode codepoints, and run codepoints through their Unicode-aware regexp engines; no, fast regular expression engines look for byte literals in megabyte-sized blocks of bytes and only when there are hits—and some refuse to believe there are hits without repeated evidence—do they bother to look for line breaks and consider the full regular expression(s).

Heavyweight processing components

Additionally, some predicates may be call-outs to entire subcomponents doing complicated things (looking at you, YARA). Searching strings or files, hashing, parsing parts of file formats, these are operations that go well beyond evaluating records of tuples/dicts in csv or json. The rules engine must call out to these subcomponents, and a considerable portion of runtime will be spent in those components and not in the evaluation of rules.

Predicate cost weighting

To repeat, runtime may be dominated in components evaluating complicated predicates completely outside the normal ken of a rules engine. Therefore, the ideal rules engine will have some notion of which types of predicates are costlier than others, and the engine will then analyze the rules and the predicate costs to evaluate the cheapest predicates first that permits early-exit.

For an example of a trick I've used for decades, though not in a general purpose rules engine, consider cryptographic hashing for file matching; if I know the file sizes of the set of [contraband/IP/malware] files I'm trying to match against then I can first check to see whether the input file has a matching file size; only if it does will the input file need to be read and hashed. This will get you several orders of magnitude of speedup when scanning disk images or live endpoints, so long as you have all the original file sizes and their cardinality is not too great.

A Dismal Conclusion

Do I have a conclusion? No, not really. Rule engines exist, but few are separate from their rule language, with tight coupling that consigns them to the same fate as that of the rule language and the commercial entities that sponsor them. Most of the best engines tap out with a virtual machine approach, avoiding the hard engineering involved in JITs and SIMD, but also missing out on the 10+ GB/s of throughput that's possible with contemporary CPUs and DRAM. A few crazy people have made it to that promised land but for specialized domains. And going crazy is meaningless if you don't consider the input format of the records you want to process.

Developing and maintaining an ideal rules engine is a hard problem, requiring genius, expertise, bloodymindedness, and the long-term focus/dedication (with concomitant financing) to scrap existing code and build new implementations from scratch in response to new hardware capabilities. Am I working on one? No. Would I like to? Sort of. Will I? Maybe, but likely without success, because genius, expertise, bloodymindedness, and long-term focus are a lot to ask, and I am lazy, tired, and distracted.

So, continue along with your YARA and SIGMA rules, and their 386-era throughput.

2022-10-30

Learning Timely & Differential Dataflow, Part I of K

Differential Dataflow is both a Rust library and a dataflow computation paradigm. This blogpost is about the former.

Except... it's not?

Differential Dataflow seems to be an advanced construct built on top of Timely Dataflow, which is also both a Rust library and a dataflow computation paradigm. The rough idea is that Differential Dataflow can accept new input items after having processed a bunch of data and update derived result sets/output with a minimum of computation; it does this by paying attention to changes/deltas/differentials between nodes in a dataflow graph. (NB: I do not understand it beyond this rough idea; some day I'll get around to reading the paper, but there are programs to write.)

In order to compute differences in a dataflow graph, though, one first needs a dataflow graph system, and that's what Timely Dataflow provides.

So, here's how to learn Timely.

Step 1: Do not read the Differential Dataflow docs.

(This may seem alternately obvious and counterintuitive.) They're less than fantastic at distinguishing what Differential Dataflow provides above and beyond Timely, yet any use of the differential dataflow crate will have one quickly using the timely dataflow API, and likely more heavily than the former.

Step 2: Read the Timely Dataflow docs.

Timely Dataflow online book (pamphlet)
Blog posts — An Introduction to Timely Dataflow

https://docs.rs/timely/latest/timely/

Step 3: Investigate magic in examples

wtf does timely::execute_from_args do?
inspect — this seems more as a debugging aid? or does it force computation?
is aggregate just reduce by a different name? is it more? is it less?

Hash is presumably used for grouping values from the different keys together, but read the source to confirm this is correct, kind of akin to Hadoop's RawComparator.
But is Hash collision resistant? Does it just use for rough partitioning and distinguish between different keys? Or do collisions (on a 64 bit int) signify equality and good luck working with strings and such?

Step 4: Write some toy programs.

Many (most?) of the example programs in the docs have to do with iterative graph computations. I can see how they're the motivating class of problems for Timely, as older systems fall down here (ahhhh, remember the days of cascading MapReduce jobs?), but it would be nice to have simpler examples with clearer analogues to other systems (by which I mean, WordCount).

In particular, Timely and Differential Dataflow are Rust libraries, so they are a readymade data processing engine that play nicely in the Rust ecosystem. The downsides of Hadoop/Spark/Flink/et al have always been their cost of scaling down and being so firmly mired in the muck of Java, requiring enormous shell scripts to manage classpaths and the like.

Rust scales down to a single machine quite nicely, and Timely purports to do so as well. Therefore, it's nice to use for data processing even if all the advanced time/windowing features are used.

Programs to write:

WordCount
WordCount w/ Combiner (or does Rust just do this?)
Map-side join
Reduce-side join
Composite aggregation, of count and sum (how many files do we have, and what are their total size?)

Step 5: Questions

Does Timely actually support distributed processing, with multiple nodes???
How hard is it to submit patches to the docs?
How much attention is it getting now that Materialize exists and is focused on a higher-level database product?

2020-01-07

Evaluating SANS Courses

I've only taken one DFIR training course, the EnCase Advanced class back in 2003, the third week into my new job at Guidance Software at their old Green Street location. It was taught by Jessica Bair and Ric Stonesifer, and, knowing nothing about forensics or EnCase, it was a good fast-paced introduction to the field. From there, I learned by doing, maintaining Guidance's core set of artifact EnScripts and then supporting the company's professional services examiners in their casework.

I've forgotten a great deal about old artifacts since then, and have not kept up on new artifacts as much lately. And, of course, there are far more systems and approaches to forensics than existed 15 years ago—macOS, iOS, Android, memory forensics, malware analysis, etc. My dimly-recalled Windows 98 artifact knowledge no longer applies.

Live training is a good option for me because I can turn my phone and email off and simply focus for a week. So, I'm thinking about taking a SANS course this year. Which one should I take? These are the possibilities:

FOR500, Windows Forensics. I do know a lot about Windows forensics, so I'm worried I might get a little bored, but this would also be a good tune-up.
FOR 508, Advanced IR & Threat Hunting. I've assisted on a number of large-scale IR scenarios, but always in a focused, limited capacity where I've been asked to write code to do a specific thing (usually at scale). Working through the complete chain of an investigation could give me new insight into current problem areas to focus on as a developer, and especially provide me with greater intuition into the needs of IR examiners.
FOR 572, Network Forensics. I've focused almost exclusively on host-based forensics, and have always thought that network forensics is not well integrated into traditional forensics investigations (especially IR and IP theft scenarios). I did a bit of libpcap network programming in 2002, but there'd be much here that'd be new to me.
FOR 578, Cyber Threat Intelligence. I'm trying to learn much more about threat intelligence and see it as directly applicable to my job. OTOH, the course seems a bit thin; working more hands-on with threat intel data sets would be good practice, but I worry about there being too much focus on jargon (I still am skeptical that "kill chain" means anything that isn't obvious) and on high-level intelligence analysis and techniques (e.g., ACH) and less on tech. It's fine for training classes to address non-technical aspects of a subject, but I personally would get bored and then not get much out of the class.
FOR 610, Malware Analysis. Here's something I don't know much about. But would I be better off taking a more in-depth reverse-engineering course from a different organization? Spending time on little utilities like pescan and pdfid would not be useful to me; fundamental deep dives into PE and ELF, unpacking, reverse engineering, call graph analysis, and so on, definitely would be great.
FOR 498, Battlefield Forensics. I worked on bulk_extractor and a few other MEDEX projects, and a lot of this seems specific to acquiring different devices and whatnot. Probably not for me.
FOR 518, Mac and iOS Forensics. I know a few macOS artifacts and use a Mac as my daily computer, but there'd be a lot here for me to learn.
FOR 585, Smartphone Forensics. I know next to nothing about smartphone forensics. This might be inspiring, as my impression of existing mobile forensics tools has not been favorable.
FOR 526, Memory Forensics. I also know very little about memory forensics. I do have a pretty good sense of how operating systems and the memory hierarchy work. The fact that this covers multiple operating systems and associated techniques is promising, though.

Are there other courses I should look at, including outside of the SANS curricula? Training works best for me when it's very fast-paced and requires hands-on work. Otherwise, my ADD-addled brain will stop paying attention and I'll just daydream.

2019-12-31

2020 Goals

My tech skill development goals for 2020:

Build something useful in Rust, with Rust proficiency as a side-effect
Develop competence with low-level optimization

tooling: nonius, perf, flame graphs, some dtrace/strace
assembly: casual familiarity reading x86, comfort with SSE/AVX and associated libraries
hardware: learn recent microarchitectures (Haswell–) for Intel and AMD, gain empirical expertise with contemporary memory hierarchies including NVMe
libraries: add a few C++ libraries to my trusted working set beyond Boost, focused on addressing particular performance issues (e.g., replacements for std::unordered_map, B-trees, a trie, a Bloom filter lib, etc.)

Do everything with some set of Docker, CloudFormation, Vagrant, and Terraform
Apply some data science/machine learning techniques to real problems
Become dangerously inept with ReactJS and Django as a frontend/backend webapp platform

With the exception of learning how to use ReactJS with Django in order to build a simple webapp, these goals are focused on developing working expertise, building off my strengths in batch processing/backend/systems development. None of these goals involve surface-level reading about new technologies in order to gain conceptual familiarity. I've read about a lot of these technologies for years. It's high time to incorporate them into my daily development activities.

2014-05-30

GNU Autotools Grievances

Part 1, of Infinity

Adding or renaming a source file requires changing Makefile.am, which means you have to run autoreconf again, which generates a new configure script, which you then must re-run, even though nothing about your configuration has changed.
Pulled some changes, now out of date, and you want to run make clean? Have to run ./configure again.
Why, yes, my system has strlen. Thank you for checking. Again. And again. And again.

Is there a sane response to C development on a system without strlen?

NB: Folks running UNICOS don't get to vote.

Every function, header, and library you check for in autoconf results in generating a C/C++ program and compiling, all done through shell script, natch.
The GNU documentation advises against globbing all source files in a given directory for your Makefile and recommends, instead, listing them all individually. Who does this?
Autoconf is really just a bunch of shell macros, which means they introduce new shell variables. So when you see a variable referenced somewhere and you search for it to find where it's set, good luck to you.
libtool! Huh! Good God, y'all... What is it good for? Absolutely nothing!

Seriously, this is a gigantor shell script whose purpose in life, to the best of my understanding, is to supply the right GCC options for building object code for shared libraries, and then linking them together.

That is, it's an enormous shell script middleman to save you the hassle of knowing "-fPIC", "-shared", and ".so". No, really.

20 years ago (maybe even 10; maybe), the platform portability problems libtool was “designed” (a strong word for anything written in shell) to fix might have been an issue. EVERYONE USES LINUX NOW. And those who use FreeBSD still use the same tools.
IT HIDES YOUR EXECUTABLES in .folders and generates shell scripts which you think would be your executables, but aren't, and their sole job in life is to fuxor the PATH and then invoke the hidden executables. Good luck using some debugger other than gcc.

The configure script is not in any way parallelizable. Running it on a wide enough machine can actually make it the slowest part of building.
Want to make a debug build? Rerun configure.
Want to use an alternate compiler on your system? Make sure you pass CC and CXX as variables on the command-line every single time you run configure. (Props to SCons for remembering default choices in its build_variables.py file.)
Have you ever tried reading through the phantasmagoria that is the shell scripts generated by autoconf? It's seriously the inspiration behind Plato's Allegory of the Cave.

2012-12-02

System Reconfig

A year and a half ago, I bought a new system. Meet Virgil:

Virgil's main specs:

2x Xeon E5645 (2.4GHz, 6 cores, 12MB L3, hyperthreading—apiece)
the legendary EVGA SR-2 motherboard
24GB RAM
a pair of WD RE4 2TB hard drives
in a Lian Li PC-V2120 case
all kinds of I/O ports and a ho-hum-but-CUDA-capable GPU

The SR-2 motherboard is a non-standard size and will not fit in a normal ATX case. It necessitated the mondo case. At the time, the SR-2 was about the only desktop board that allowed for dual Xeons (otherwise you needed to buy a true server; blech). A year and a half after purchase, it is still far more computer than I need and I sometimes feel bad considering how much it idles. With its cores at only 2.4GHz, some laptops can beat Virgil at single-threaded tasks, but Virgil excels at parallel compiles, data processing, and the like. With Xeon fans, Virgil sounds like a red-lined Ferrari when its under load.

Operating Systems

I first started off running Ubuntu, which I mostly liked. However, the upgrade to 12.04 failed completely—Virgil could not boot, and I had no interest in trouble-shooting the failure. Additionally, I found Ubuntu's Unity interface a little tiresome. I probably appreciated it more than most Linux users, but it involved a bit more mousing around than I wanted.

A colleague is an expert Fedora user, so I installed Fedora. I figure it'd be worthwhile to learn an alternate approach and I could always piggyback off his knowledge. However, I hated Fedora. Gnome 3 is a disaster, Xfce felt like 1998 all over again, and Fedora demanded far too much sysadmin time from me. I didn't want to re-learn how to setup Samba configuration files again, as if it was still 2002; I wanted to click a checkbox and have my home folder shared so I could access it from VMs without any fuss. This was possible on Ubuntu... not so much on Fedora.

Additionally, I gradually realized that the Fedora installer had messed up my RAID. Under Ubuntu, I'd configured the two drives to act as an LVM RAID-1. When I installed Fedora, it could not recognize the partition, which was hugely annoying, but it did seem to recognize the underlying LVM RAID. I was wrong, though: it turned the RAID-1 into a spanned volume. I no longer had a redundant 2TB drive; I had a 4TB drive with twice the probability of failure as a single drive and no redeeming performance qualities.

Of course, by the time I realized this (disclosure: I am really bad at IT/sysadmin), I'd already accumulated a little over 2TB of data, thanks in large part of VirusShare. It would not be so simple to go back to the RAID-1.

Reconfig

So I bought another 2TB WD RE4 hard drive and a 256GB OCZ Vertex 4 SSD. I had two main goals: first, I wanted to install Mint as the main Linux system and I wanted to do it in such a way that I could trash it, install some other system, and be up and running again in a day or less, without too much time spent sysadmin'ing; second, I wanted a RAID for my main data, and I wanted the volume somewhat separate from /home so I could simply remount it as a subdirectory in my home directory under a new system, perhaps issue a recursive chown & chgrp, and be good to go. (Fedora and Ubuntu use different numbering schemes for UIDs, of course; it should be possible to preserve one's UID from system to system with a little bit of work, but the underlying theme of this blog post is how I am too lazy to do such things).

I first installed the SSD in the system and installed the newly released Mint 14 (with the Cinnamon UI) on a single / partition. I then backed up all of my data to an external 3TB drive, installed the 3rd WD drive, and blew my old data away.

Although the SR-2 board supports various RAID options in hardware, I decided I wanted a software RAID so I could survive a motherboard failure. Virgil has compute power to spare, and these are only 7200 RPM drives so I shouldn't suffer significant performance loss by going the software RAID route. I first constructed an unformatted partition on each of the drives, leaving a small buffer before and after in case I someday need to add another drive that differs slightly in its sector count. I followed the instructions to construct a RAID5 array from the three partitions, according the Linux RAID wiki:

mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1

mdadm --detail --scan >> /etc/mdadm/mdadm.conf

This created a 4TB RAID5 across the disks, with a default-selected chunk size of 512KB.

XFS

I narrowed my choices for the filesystem to EXT4 or XFS. Since I have a mix of very large files (evidence and other archives) and very small files (source code), I wanted an extent-based filesystem. Despite its advanced feature-set, reading up on BtrFS does not fill one with confidence about its reliability. Of course, one can find data loss stories about EXT4 and XFS as well. In the end I decided to go with XFS as it is an older design that's been strenuously tested on far more advanced RAID hardware, but it's still under active development and maintenance. This is an intuitive choice more than anything.

So I issued the command:

mkfs -t xfs /dev/md0

and immediately was presented with an error that the 512KB chunk size was too large for XFS. It seems that the default chunk size of 512KB is a recent change in the mdadm tools, likely geared towards EXT4. However, the maximum chunk size that XFS can handle is 256KB and XFS developers still recommend a 32KB chunk size in the general case. So, I deleted the RAID and recreated it:

mdadm --create --verbose /dev/md0 --level=5 --devices=3 /dev/sda1 /dev/sdc1 /dev/sdd1

mdadm --detail --scan >> /etc/mdadm/mdadm.conf

mkfs -t xfs /dev/md0

mount /dev/md0 /mnt/raid

mkdir /mnt/raid/jon/data

ln -s /mnt/raid/jon/data /home/jon/data

I now had my XFS RAID5 and had it easily accessible inside my home directory, while still having the overall volume available for other uses.

One thing I noticed in reading about Linux RAIDs is that many people use LVM on top of the raw RAID device and then create different filesystems on top of LVM. If one were managing a system with a few large RAIDs for a large number of users, I could see how LVM would be useful. However, I cannot see any utility for it when you just want a single RAID volume. There's likely no performance loss to using LVM, but it adds another layer of complexity... and I am too lazy for unnecessary complexity.

The Disposable System

The RAID solves my data problem, but what about being able to get up and running with another system in the future? This enters into the realm of configuration management. There are two new configuration management systems that are taking the open source world by storm, Puppet and Chef. Both are implemented in Ruby and both seem like good pieces of software. They're both geared towards sysadmins who need to manage many systems. Essentially what both do is take configuration files specifying what packages, files, user accounts, and other configuration details should apply to a given system, and then they make the system conform to that description.

The main differentiator between Puppet and Chef is that Puppet has its own syntax for its config files whereas Chef's config files are valid Ruby source code. What I want to do is simply track the packages I install over time, so that I can get them reinstalled quickly if I need to. After looking at the syntax for each of Puppet and Chef, I decided that Puppet would be slightly more simple.

So, what I have been doing is building a directory of Puppet config files, each containing a list of software packages. The config files are roughly categorized, so I have one for basic C/C++ development, another for Python, one for system tools, one for GUI applications, etc. When I find myself in need of a package, I add it to the appropriate config file, and I can then bring my system into compliance with:

sudo puppet apply config-file.pp

Puppet figures out which, if any, packages are missing and then invokes apt-get to download and install them. If I switched back to Fedora, or to FreeBSD or some other crazy system, I would likely need to remap some of these package names slightly, but at least I'm tracking them. Additionally, I will sometimes create an EC2 instance to accomplish a specific task, and having categorized Puppet config files should help me bootstrap barebones instances in short order. I track my config files with git, natch, and mirror them on Github as I do with all my other source code.

Data has mass; I now have a RAID that should be able to survive a move to a different environment without having to make copies. And by using Puppet to reduce system management into, essentially, software development, I have reduced the mass of my operating system to almost nothing. All I need to do on a new system is get it to mount my RAID, drop my ssh keys, install Puppet, and then wait while all the rest of the software I depend on is downloaded and installed automatically.

etckeeper

Speaking of Git, another useful software package for anyone running Linux is etckeeper. etckeeper is a software package that interacts with your package management system and tracks changes to /etc. If you inadvertently screw up some aspect of your system via editing of various /etc config files, you can always go back in time to restore things. In this respect, it's sort of like System Restore, except that it works.
`

2012-11-14

Representing Encoding Chains

Here's a problem I've been mulling over the last couple days: what is a good way to represent encoding chains?

What's an encoding chain?

When we search for text, it's always in some kind of encoding, something that specifies how logical characters are represented as "physical" bytes (a discussion of the physicality of bytes is best left for happy hour). For example, "A" is represented as byte 0x41 in the ANSI or UTF-8 encodings but as the two-byte sequence 0x41 0x00 in UTF-16LE (aka "two-byte Unicode") and UCS-2 (aka "fake two-byte Unicode"; never use UCS-2).

To search for a keyword in a file, we either need to know the encoding used for the file and convert the bytes in the file to logical characters to perform the comparison to the keyword or we need to convert the characters in the keywords into their binary form using the file's encoding and then search for those bytes in the file. Most text editors do the former; Lightgrep does the latter, as we don't usually know the encodings used for files, especially when searching fragments in unallocated space and whatnot.

An encoding chain goes beyond a single encoding: it's what happens when we want to deal with data that's been subject to a series of transformations, including a normal characters-to-bytes encoding.

Examples

A useful artifact in forensics is the UserAssist registry key. These keys store recently run programs, recording the name of the program, when it was last run, and the number of times it was run. A bewildering aspect of values under the UserAssist key is that the names of the programs are ROT13 "encrypted". This means that "A" becomes "N", "B" becomes "O", and so on.

ROT13 seems like a silly form of obfuscation... until you need to search for UserAssist registry keys pertaining to a particular program. Many forensic tools will parse registry hives and some will even decrypt the UserAssist keys for you, but if you'd like to look for traces of keys in the pagefile or in unallocated, you are out of luck when it comes to tool support. You must manually apply ROT13 to the name of the program ("notepad" becomes "abgrcnq") and then apply the UTF-16LE encoding to it ("abgrcnq" becomes "a\x00b\x00g\x00r\x00c\x00n\x00q\x00"). Tedious, no?

Another example (featuring more love for Didier Stevens) is when looking for obfuscated strings in malware. XORSearch is a program Messrs. Stevens recently released which lets you search for strings that have been obfuscated with XOR, ROR (rotate-right), or ROL (rotate-left) transformations. These types of transforms are often used by malware authors to obfuscate strings like URLs.

When searching, it would be very nice to say "look for 'notepad' but in ROT13'd UTF-16LE" or "look for 'http://sirhaxal0t.ru/' in UTF-8 that's been XOR'ed". The permutation of "ROT13, then UTF-16LE" or "UTF-8, then XOR" is what I mean when I say "encoding chain." We want to apply a chain of encodings to our search terms.

Characters v. Bytes

ROT13 is a transformation that applies to characters, with "A" becoming "N" and so on. In contrast, XOR is a transformation that applies to bytes. Character transforms apply to the string first before it's encoded into bytes, whereas byte transforms apply to the bytes after the encoding has happened. Here's a table that clearly delineates transformations and encodings:

Character Transforms	Encodings	Byte Transforms
ROT13, URL encoding, HTML entities	ASCII, UTF-8, UTF-16LE, EBCDIC, Shift-JIS, ISO-8859-5, etc.	XOR, ROR, ROL, Outlook Compressible Encryption, Base64

You may have noticed that there are a few constraints to the encoding chains we'd like to apply. First, there's an ordering constraint in that character transforms must come first, followed by the encoding, and finally by byte transforms. An encoding chain like "XOR -> ROT13 -> UTF-8" is invalid and nonsensical. Second, while it's possible to have multiple character transforms or byte transforms in a chain, there can be only one encoding. An encoding changes the type of thing we're working with from characters (ideally Unicode codepoints) to bytes, and that type change can only occur once.

The Ultimate Problem

We've been working on adding support for Outlook Compressible Encryption to Lightgrep, and that's necessitated thinking about encoding chains and how to expose the abstraction. At this point the main technical work is done and we are arguing about how to represent the encoding chains.

Whither:

"outlook(base64(utf8(rot13())))"

or

"ROT13 | UTF-8 | Base64 | Outlook"

or

"ROT13 > UTF-8 > Base64 > Outlook"

?

We'd like to have a good notation for specifying these chains. One interesting thing to note is that some transformations may want parameters. For example, ROR and ROL need to know how many bits of shift to apply. It would be silly to have ROR1, ROR2, ROR3, ROR4, ROR5, ROR6, and ROR7, when we could just have "ROR(2)", for example. With parameters, we could come up with a generic rotate character transformation, and then ROT13 would be "ROT(13)". (If you were investigating a classically trained hacker, you may want to look for text obfuscated by the Caesar cipher, i.e., ROT(3).)

So...

...which representation above do you prefer? Or can you come up with something better?

* Addendum *

You can expect Outlook Compressible Encryption support in Lightgrep fairly soon, but it will be some time before you see base64. Base64 is not so much a byte transformation as it is a bit transformation, taking 6 bits of data and turning them into 8. Lightgrep needs some work in order to work well with bits.