2023-09-20

Running Rules — A Rough Ranking

A caveat: this post is more brain dump than research. I would not be surprised if multiple academic papers have laid out competing hierarchies and frameworks for classifying rules engines. However, doing research is hard. It takes me days of searching to find good papers, and then it takes me weeks of re-reading to grok them. So what follows is entirely NIH.

Problem definition

Say you have a big pile of data. There's a file, it has some records in it. The records are serialized in some format, like json, protobufs, pcaps, or, more ridiculously, binary xml records inside of a chunked binary file format (how I miss LfLe!). Maybe there's some compression involved. And what your boss/department/company/industry would like to know is whether individual records in that data set match any number of rules in a set and, if so, which ones.

Note how uncomplicated of a task this seems. We don't need to sum any of these fields across records, compute statistics on them, join them with lots of other data sets (well, we do and will; cyber calls this phase "enrichment" but it happens much later), or such. Our problem is, for each of n records, which of our m rules matches on the record?

(Note also how this is actually a join, but since we're so often processing complex records in complicated formats as it lies and using rules with a rich and complicated set of predicates, refining the computation to the point where it can be performed as a join is damn near impossible.)

Rules

Ok, records, sure, records are easy. Tuples of fields, maybe there's some typing, maybe the fields have names, maybe there are complex types, but, still, we all understand these records: they're logs, CSV, JSON, maybe protobufs/Avro/Thrift or pcaps. What are the rules?

A rule could be very simple, e.g., "Value = 5". A simple evaluation rule like this can be called a predicate; it's performing a single evaluation against a single field in the record and the result of the predicate is either true or false. Predicates can be joined together into complex rules with Boolean logic. The operators in the predicates will likely permit equality checks and comparisons. Often there will be functions (which are also just operators without fancy syntax) like "contains", "in", "like", "matches", which help with strings and sets. Perhaps if you're unlucky, you'll be working in an extensible rules language where a predicate can involve parsing kilo/megabytes of raw data to compute some fields on the fly. Then you'll tire of such complexity, and the accompanying slowdown in runtime and development time that comes with it, and you'll adopt the hot new rules language that seems simpler and friendlier, the kind assembled by field practitioners who can't be bothered with knowing that describing computer language syntax was solved 64 years ago.

Rules evaluate different fields in the records, often comparing them against literal values. It may be possible for rules to compare different fields in the same record, e.g., "record.created_timestamp < record.lastmodified_timestamp", though this is rare. What we will not consider is the ability for rules to compare/evaluate multiple/different records with possible aggregations. That's what SQL is for.

Implementation types, ranked

So, how do we write a program to run the rules against our data? What follows is a ranking of implementation approaches I can envision, from least optimized to most. This ranking primarily considers throughput, i.e., how many records can evaluated against a given set of rules in a measure of time.

Ad hoc

Rules seem hard and abstract. Unrestricted Boolean logic? Comparisons? Code Monkey think this is hard! (Code Monkey output not functional or elegant.)

If rules can be restricted to simpler and specialized forms, then an ad hoc rules engine may only need to consider hard-coded logic for looking up hash values in a hash set, parsing PE file headers, or conducting string searches. Perhaps a few different types of rules can express a small number of commonly desired Boolean forms, so a rule could simply OR constituent ad hoc evaluation functions, or maybe AND then, or negate them. I did this with EnCase eDiscovery/EnCase Cybersecurity, where the user could choose from three different Boolean forms for combining logic on filesystem metadata, file content search terms, and file hashes. Those three options handled all the practical use cases.

Parse trees

Ad hoc engines are pretty limited, however, and parsers aren't that hard, once you know how to write one. So, if you can parse formal rules, with Boolean logic, then you can build a parse tree data structure. And if you can build a parse tree, then you can execute it according to the types of nodes in the parse tree. The engine will generally involve a stack data structure and one or more switch statements for executing what to do for each type of node in the parse tree (e.g., an OR operator). Sometimes the parse tree may be turned into an abstract form, with enums and objects; other times it'll just be character literals that are being compared directly.

Performance will vary amongst implementations, and especially depend on the implementation language/platform. An engine written in Python will be a lot slower than one written in Rust or C. Still, throughput may only be 10MB/s or less, and will degrade considerably as the rules add up — each rule will need to be parsed and evaluated, and there's usually not any work done to consolidate logic or even just useful setup work between rules. If parsing is not a separate step from evaluation, then the parsing logic itself will need to be re-evaluated with each record.

A possible way out of writing a proper parser and evaluation engine is to recast the rules into some other language that can be embedded and executed itself; I also did this in EnCase eDiscovery/EnCase Cybersecurity where the EnScript engine itself was exposed as an API to EnScript, and then I could just generate EnScript snippets on the fly as EnCase Filters and execute that code against objects. And when I first started thinking about evaluating rules, in 2005, I realized that if the industry was smart it would simply write its rules in Prolog and life would be easy.

Did the industry adopt Prolog?- 

no

(That was a Prolog joke.)

Instead the industry first adopted OpenIOC (yay xml) and then whatever-the-hell YARA is. And now there's sigma, which is even less well-defined than these two predecessors.

Examples: Hayabusa, chainsaw (a bit more elegant)

Virtual machines

If you can generate a parse tree and you refactor your code enough, you'll often find you have a loop that pops nodes off a stack, evaluates those nodes, and then pushes them back onto the stack until the stack contains only one element and there's nothing left in the rule to parse: that one element then is the result.

If you have that code, then it is not a big leap to represent those nodes as invented byte code, or p-code, and then to execute that byte code. As it happens, this leap can make the evaluation engine astonishingly straightforward — there's a loop around decoding the current instruction where the opcode's evaluated by a switch statement and the cases perform the operation. The complexity of direct parse tree evaluation can be spread out across a separate compilation step, which transforms the parse tree into the byte code.

Generally these virtual machines fall into two classes: stack-based and stackless. Stack-based virtual machines are typically simple and low-level, where the operations are extremely simple and evaluation follows the parse tree representation. Stackless VMs may use a byte code representation that is higher-level and domain specific, with instructions that correspond more directly to complicated functions (e.g., "contains" resulting in a string find). This is not a precise distinction, of course; the precise distinction is that stackless VMs lack a stack... but lacking a stack often results in greater complexity being put into the operations.

Virtual machines have some nice properties that bring performance benefits. Decoding and evaluation results in simple code, and optimizing compilers love simple code, which they can tease out into sequences of instructions that pipelined CPUs can evaluate efficiently. There's often far less branching in the resulting code, so there are fewer stalls. And critically, the byte code tends to be compact and stored altogether in memory, resulting in excellent L1 cache hit characteristics.

With a separate compilation step, the virtual machine approach can also be amenable to optimization passes. For example, across a set of rules there may be duplicate predicates. A rule engine that understands all the rules at once could deduplicate evaluation of these predicates. It could also reorder evaluation of predicates to try to prove rules don't match (the common case) with early exits.

Lightgrep uses the virtual machine, in a stackless complicated way. It's based in no small part on Russ Cox's excellent blog post Regular Expression Matching: the Virtual Machine Approach. Whole programming languages can be implemented with virtual machines, too, like Python. Another notable example that's generated considerable work and blog posts over the years is Lua. One could do a lot worse than embedding Lua and calling it a day.

Other examples: jq, PCRE

JITs

If you're parsing rules and compiling them to byte code, well, why not compile them to machine code instead and make it the CPU's problem? This is called a JIT, for just-in-time.

Do you like speed? JITs can be blazing fast. They lower evaluation to the level of the machine, where there is no engine, i.e., getting your crappy Code Monkey software out of the way of the CPU is probably the best way to go fast.

But JITs also have a cost — welcome to tradeoffs! First, rules tend to be fairly compact, a few dozen characters. Those rules can unroll to hundreds or thousands of machine code instructions. Invented virtual machine byte code can still be compact, but machine code deals with registers, raw arithmetic, and loads/stores. So, typically, JITs will need more memory to represent the rules, and this can turn into a problem when you have a lot of rules.

Additionally, compiling the rules to machine code can take time, so there are larger startup costs. This is especially the case if your JIT has optimization passes. This often leads to faster evaluation, but could wind up being a false economy, where a simpler engine would get through processing all the data before a JIT even starts running.

What's more, an off-the-shelf engine with optimizers can itself be dozens of megabytes in size. You're going to embed LLVM into your tool? Good luck fitting that on a floppy and/or downloading to point-of-sale endpoints running on a cheap Comcast link.

Portability also becomes a concern. If you prudently decide not to embed LLVM, now you need to find a JIT that's portable if you want to support more than x86. Good luck! I'm sure there are solutions, but I don't know of a clear answer off the top of my head.

JITs also have security concerns. A normal executable gets compiled and it runs. Modern operating systems will treat its code and data separately, with the code protected by the virtual memory manager as read-only and the data treated as non-executable. With JITs, data and code are the same thing! The JIT will need to write out machine code to memory and then execute it directly; this requires some low-level operating system trickery.

Examples: PCREjit, luajit, redgrep, Ken Thompson's original regular expression search engine for IBM 7049

SIMDized

Modern CPUs have a lot of capability. Pipelined CPUs became common forty years ago. Thirty years ago CPUs went super-scalar. Twenty years ago CPUs went multi-core.

A rule engine that uses a JIT, especially a simple one without optimization passes, will struggle with pipelining as it's likely to embed many branches that mispredict with some regularity (a 90% branch prediction rate is still pretty bad). It will likely not execute with IPC rates > 1.0 (unless you're running on one of Apples magic M chips, that put jillions of transistors towards a monstrous ROB). And for sure you will not get multiple cores utilized, as multithreading is always an exercise left to the user (and rightfully so).

These three limitations need to be addressed in a rules engine if you want to fly as fast as the hardware can go. Good luck! I don't have good answers for you other than lots and lots of hard work.

And even if you are able to write a fantastic rules engine that JITs out code with high branch prediction and IPC rates, and uses threads smartly to process [batches of, you'll need to amortize your locking to avoid stalling] records in parallel, you still won't be using your CPU to its fullest potential.

18 years ago Intel released SSE3, and SSE3 quietly revolutionized SIMD optimization. It took awhile for folks to figure out the magic new pshufb instruction, but each new SIMD ISA since has added lots of capability (though ARM's NEON ISA is more comparable to SSE 4.2 or AVX, with 128 bit registers while just about everyone with an Intel CPU has at least 256 bit registers and many have 512 bit registers). Contemporary CPUs can now often retire multiple SIMD instructions per cycle, if they're scheduled correctly.

Predicates in rules evaluate to true and false. Rules combine predicates with Boolean logic. If you're thinking what I'm thinking, then finding ways to represent predicates and rules in SIMD bit vectors is likely to provide a lot of benefit.

However, it's easy to forget the MD in SIMD. A good SIMD engine will likely only gain high throughput by biting off a chunk of records, decomposing them into their fields, and evaluating those fields with SIMD operations in parallel, then, many instructions later, taking a look at the result and separating matching records.

One major trouble with adopting SIMD is portability. Even amongst x86 CPUs, there are multiple targets ISAs (AVX-512 is a family [and a Tolstoy-esque unhappy one at that] of differing ISAs, then there's AVX-2, AVX, SSE 4.2 for the Sandy Bridge diehards like me...), and different CPU microarchitectures may have very different performance characteristics for the instructions in those ISAs. And then there's ARM NEON. Oy.

Another major trouble with using SIMD operations in a rules engine is that the predicates in a modern cybersecurity ruleset are many, but most evaluate to false a great deal of the time. So a bit vector representation of all the predicates across all the rules will still be large (thousands of bits), much too large for a single register, and most of the bits will be unset. So, there's natural sparsity to cope with — maybe a bit vector representation isn't worth it? Another major trouble will be the dependencies between predicates in the rules; how do you benefit from early exit with a SIMD approach?

The only answer when it comes to SIMD optimization is empirical bloodymindedness.

Examples: none that I know of, naïve SIMDization is an uncanny valley of death

Multi-algorithm SIMD with ad hoc components

Bloodymindedness it is then. Relying on simple SIMD operations to evaluate sparse bit vectors with thousands of bits could lose out to a normal JIT engine, maybe even a VM. The way to apply SIMD is to look for amenable types of predicates and rules and target those with specialized SIMD algorithms.

Here Hyperscan reigns supreme and alone, and with its integration into Suricata pcap scanning has benefited from it for a number of years. Noseyparker also integrates Hyperscan for secrets scanning. But Hyperscan, as an Intel project, is unsurprisingly Intel-only. There's been an attempt or two to port Hyperscan to ARM, but Hyperscan's development itself seems tenuous and the porting attempts can be characterized as "academic" in comparison.

DuckDB is also worthy of mention here. As a columnar database, it is a good fit for SIMD operations, and it benefits greatly from them. Any Python programmer considering writing a rules engine should do everyone a favor and instead focus on ingesting their data into DuckDB. (I would also extend this recommendation to many C programmers who've written popular engines, as well as to those that naïvely bundle such engines into larger scanners.)

Further Considerations

Record parsing

Rules don't exist in a vacuum. After all, what we care about are the records. And records are inevitably represented in their own form (CSV, json, in-memory data structures, etc.). Dealing with those records and getting their necessary fields out to where they can be evaluated by predicates in our ruleset is not free. In fact, it may dominate the runtime, perhaps by orders of magnitude, especially if we're not careful with how the data's encoded (ugh cybox) and which libraries are used (🤡).

An interesting approach is simdjson. Simdjson uses SIMD operations in parsing json records in parallel, but it also embeds some [very] rudimentary rules capability with JSON Pointer support. That allows it to evaluate json records without first doing a full parse.

Avoiding full parsing has of course been a key aspect in regular expression searching. Fast regular expressions don't parse text into lines, parse lines into Unicode codepoints, and run codepoints through their Unicode-aware regexp engines; no, fast regular expression engines look for byte literals in megabyte-sized blocks of bytes and only when there are hits—and some refuse to believe there are hits without repeated evidence—do they bother to look for line breaks and consider the full regular expression(s).

Heavyweight processing components

Additionally, some predicates may be call-outs to entire subcomponents doing complicated things (looking at you, YARA). Searching strings or files, hashing, parsing parts of file formats, these are operations that go well beyond evaluating records of tuples/dicts in csv or json. The rules engine must call out to these subcomponents, and a considerable portion of runtime will be spent in those components and not in the evaluation of rules.

Predicate cost weighting

To repeat, runtime may be dominated in components evaluating complicated predicates completely outside the normal ken of a rules engine. Therefore, the ideal rules engine will have some notion of which types of predicates are costlier than others, and the engine will then analyze the rules and the predicate costs to evaluate the cheapest predicates first that permits early-exit.

For an example of a trick I've used for decades, though not in a general purpose rules engine, consider cryptographic hashing for file matching; if I know the file sizes of the set of [contraband/IP/malware] files I'm trying to match against then I can first check to see whether the input file has a matching file size; only if it does will the input file need to be read and hashed. This will get you several orders of magnitude of speedup when scanning disk images or live endpoints, so long as you have all the original file sizes and their cardinality is not too great.

A Dismal Conclusion

Do I have a conclusion? No, not really. Rule engines exist, but few are separate from their rule language, with tight coupling that consigns them to the same fate as that of the rule language and the commercial entities that sponsor them. Most of the best engines tap out with a virtual machine approach, avoiding the hard engineering involved in JITs and SIMD, but also missing out on the 10+ GB/s of throughput that's possible with contemporary CPUs and DRAM. A few crazy people have made it to that promised land but for specialized domains. And going crazy is meaningless if you don't consider the input format of the records you want to process.

Developing and maintaining an ideal rules engine is a hard problem, requiring genius, expertise, bloodymindedness, and the long-term focus/dedication (with concomitant financing) to scrap existing code and build new implementations from scratch in response to new hardware capabilities. Am I working on one? No. Would I like to? Sort of. Will I? Maybe, but likely without success, because genius, expertise, bloodymindedness, and long-term focus are a lot to ask, and I am lazy, tired, and distracted.

So, continue along with your YARA and SIGMA rules, and their 386-era throughput.

No comments:

Post a Comment