2022-10-30

Learning Timely & Differential Dataflow, Part I of K

Differential Dataflow is both a Rust library and a dataflow computation paradigm. This blogpost is about the former.

Except... it's not?

Differential Dataflow seems to be an advanced construct built on top of Timely Dataflow, which is also both a Rust library and a dataflow computation paradigm. The rough idea is that Differential Dataflow can accept new input items after having processed a bunch of data and update derived result sets/output with a minimum of computation; it does this by paying attention to changes/deltas/differentials between nodes in a dataflow graph. (NB: I do not understand it beyond this rough idea; some day I'll get around to reading the paper, but there are programs to write.)

In order to compute differences in a dataflow graph, though, one first needs a dataflow graph system, and that's what Timely Dataflow provides.

So, here's how to learn Timely.

Step 1: Do not read the Differential Dataflow docs.

(This may seem alternately obvious and counterintuitive.) They're less than fantastic at distinguishing what Differential Dataflow provides above and beyond Timely, yet any use of the differential dataflow crate will have one quickly using the timely dataflow API, and likely more heavily than the former.

Step 2: Read the Timely Dataflow docs.

Step 3: Investigate magic in examples

  • wtf does timely::execute_from_args do?
  • inspect — this seems more as a debugging aid? or does it force computation?
  • is aggregate just reduce by a different name? is it more? is it less?
    • Hash is presumably used for grouping values from the different keys together, but read the source to confirm this is correct, kind of akin to Hadoop's RawComparator.
    • But is Hash collision resistant? Does it just use for rough partitioning and distinguish between different keys? Or do collisions (on a 64 bit int) signify equality and good luck working with strings and such?

Step 4: Write some toy programs.

Many (most?) of the example programs in the docs have to do with iterative graph computations. I can see how they're the motivating class of problems for Timely, as older systems fall down here (ahhhh, remember the days of cascading MapReduce jobs?), but it would be nice to have simpler examples with clearer analogues to other systems (by which I mean, WordCount).

In particular, Timely and Differential Dataflow are Rust libraries, so they are a readymade data processing engine that play nicely in the Rust ecosystem. The downsides of Hadoop/Spark/Flink/et al have always been their cost of scaling down and being so firmly mired in the muck of Java, requiring enormous shell scripts to manage classpaths and the like.

Rust scales down to a single machine quite nicely, and Timely purports to do so as well. Therefore, it's nice to use for data processing even if all the advanced time/windowing features are used.

Programs to write:

  • WordCount
  • WordCount w/ Combiner (or does Rust just do this?)
  • Map-side join
  • Reduce-side join
  • Composite aggregation, of count and sum (how many files do we have, and what are their total size?)

Step 5: Questions

  • Does Timely actually support distributed processing, with multiple nodes???
  • How hard is it to submit patches to the docs?
  • How much attention is it getting now that Materialize exists and is focused on a higher-level database product?