Next Generation of Hadoop

The last couple of weeks has seen some new information trickle out from Yahoo! about their efforts to improve scalability on Hadoop. (For a sense of a scale to the uninitiated, by "improve scalability" I mean, "scale beyond clusters of 4,000 servers.") Yahoo! is calling this effort Next Generation Hadoop, or, Hadoop .Next.

To paraphrase V.S. Naipaul on the New Yorker and fiction, Yahoo! knows nothing about enterprise software marketing, nothing.

To those paying attention, it's clear there's been tension between Yahoo!, Hadoop's first and most important patron, and Cloudera, the usurper. It's not hard to play armchair psychologist and speculate about the forces behind the tension, but I certainly don't know enough to comment on it intelligently. What is undeniable, though, is that release momentum hiccuped at a critical period in Hadoop's adoption by the industry at large, and that momentum has only recently been restored. Yahoo!'s recent statement about abandoning their own Hadoop distro and working to improve Apache trunk was very good news.

Still, remember that Hadoop is not yet that magical 1.0. It's changed enormously over the past year or two, and for the better. It's silly it's not considered 1.0 already, and it's clear that a 1.0 designation is coming down the pike.

THEREFORE: It makes no sense whatsoever to talk about "Next Generation Hadoop." All the technical reasons are sound, but here's what this sounds like to me: "We haven't hit 1.0 yet, but we've already come down with a terminal case of Second System Effect." Moreover, this sense of foreboding is not at all helped by the fact that the presentations and documents about Hadoop .Next have not included any information about the most important facet of a Hadoop refactoring: the API. So, not only do I have to worry about deciding between implementing Mapper or inheriting from Mapper, I'm now worried that I'll have to abandon Mapper for ConfigurableGenericJobTask and ConfigurableGenericJobTaskFactoryInterface and rewrite all my code to suit. Viva la Revolution.

I'm not really that worried. Mostly, Hadoop has been evolving according to a Teilhardian roadmap. However, it'd be fantastic if the community released 1.0 in 2011 and phased in Hadoop .Next incrementally, without making it seem so disruptive.


Fast Unique Files filter for EnCase

Related to my post Fast Hash Matching post from November and Lance's original post, here's the code to an EnCase filter for the Entries view that will show you only the first occurrence of each file by hash value:

include "GSI_Basic"

class MainClass {
  NameListClass  HashList;
  BinaryTreeClass Tree;

  MainClass() :
    Tree(HashList, new NodeStringCompareClass())
    if (SystemClass::CANCEL == SystemClass::Message(
     SystemClass::ICONINFORMATION |
     SystemClass::MBOKCANCEL, "Unique Files By Hash",
     "Note:\nFiles must be hashed prior to running this filter."))

  bool Main(EntryClass entry) {
    HashClass hash = entry.HashValue();
    if (Tree.FastFind(hash)) {
      return false;
    else {
      Tree.FastInsert(new NameListClass(null, hash), hash);
      return true;
The code in the comments on Lance's blog is close, but not quite correct, maybe due to the comment form or some such. You need to hash all files before you run this. As discussed in my earlier post, this is by no means the fastest possible way to do this, but I recently had someone ping me about needing exactly this, and it makes sense to put it up for everyone.

This filter is utterly dependent on BinaryTreeClass in GSI_Basic.EnScript, a support file that comes with EnCase and can be found at Program Files\EnCase 6\EnScript\Include\GSI_Basic.EnScript.

I also have put an ini file of the filter that you can import directly into EnCase (right-click in the filter tree, choose Import...), available here on Google Docs.

Dynamic Features in EnScript

EnScript is first and foremost a static language. It's statically typed and statically compiled, which means that EnScript figures out everything in your script before it starts to execute it, and if it cannot figure out which code should be called or finds any other syntactic problems, it generates a compilation error. It also means that EnScript executes reasonably efficiently, since the compiler has figured out which functions to call in advance, and this is done in a relatively lightweight manner.

(It should be noted that EnScript does not do much in the way of code-optimization, however, so dynamic languages that make use of just-in-time compilation and other aggressive techniques may yet execute faster than EnScript.)

Despite being squarely in the static camp of languages, EnScript does have some features which can be considered "dynamic." By dynamic, I mean that you can write an EnScript that is able to interact with the EnScript engine and script code, at least in some ways.

The dynamic features in EnScript are:
  • Typecasting
  • Class reflection
  • Property accessors
  • Subordinate execution through ProgramClass
These dynamic features make EnScript far more powerful than it initially seems. I'll be writing about these features over the next few weeks. Since typecasting is short and sweet, I'll cover that now.


Let's say you're working with a NameListClass object. Since NameListClass inherits from NodeClass, you can always treat a NameListClass object as a NodeClass object without doing anything special:

NameListClass list();
list.Parse("one two three", " ");

NodeClass nodeRef = list; // up-cast

This is an example of upcasting, where we treat a derived object as its parent type by manipulating it through a parent-type reference. EnScript knows that NameListClass inherits from NodeClass, so there's no need for you to do anything special, and this cannot possibly fail. There is no ambiguity.


What if we want to do the reverse? Say we have a function that takes a NodeClass object as a parameter, and we'd like to treat it specially if it's a NameListClass object? Here's the answer:

void foo(NodeClass node) {
  NameListClass listRef = NameListClass::TypeCast(node);
  if (listRef) {
    // ...

As it turns out, every class in EnScript has a static function named TypeCast(). TypeCast() takes an ObjectClass reference and returns a reference to an object for its particular type, e.g., NameListClass::TypeCast() returns a NameListClass object reference. Because every object inherits from ObjectClass implicitly (even NodeClass), you can pass just about anything into TypeCast(), and because NameListClass::TypeCast() returns a reference to a NameListClass object, EnScript's compiler doesn't complain about assigning the result to the listRef reference here. TypeCast() lets us bridge the gap, allowing us to treat our NodeClass as a NameListClass.

Note here that we then check listRef to see whether its null. Why? Well, what if someone called foo() and passed in an EntryClass object? An EntryClass object is not a NameListClass object, so NameListClass::TypeCast() will return null. By doing so, you can safely test your downcast. Of course, if you didn't do the check, the code would compile but you'd end up with a null reference error when someone called foo() with anything other than a NameListClass object.

One more thing about TypeCast(): Don't pass it a null reference. TypeCast() itself will generate a null reference error if you pass it null. Therefore, to be truly safe, foo() should look like this:

void foo(NodeClass node) {
  if (node) {
    NameListClass listRef = NameListClass::TypeCast(node);
    if (listRef) {
      // ...

Single Object Inheritance is the Root of All Evil

If you make your own class hierarchies, you will find yourself having to make use of TypeCast() quite a bit. Let's say you have an AnimalClass as a base class and two derived classes, DogClass and CatClass. Further, we want to have a method to check whether two AnimalClass objects are equal (...uh, whatever that means...). So, we would have to create a virtual function in AnimalClass, isEqual(), that looks like this:

class AnimalClass {
  pure bool isEqual(AnimalClass other);

and then override the function in DogClass and CatClass. Clearly, a dog is not ever going to be equal to a cat. But not all cats are equal to each other, either. We will need to check their favorite brand of kitty food, kitty litter, markings, meowing habits, and whatnot to decide two cats are equal. So, even though isEqual() receives an AnimalClass object, we'll need to treat it as a CatClass to inspect all the member variables that are specific to cats and make this determination.

class CatClass {
  String FaveWetFood, FaveDryFood, FaveLitter;

  virtual bool isEqual(AnimalClass other) {
    CatClass cat = CatClass::TypeCast(other);
    if (cat) {
      return FaveWetFood == cat.FaveWetFood
        && FaveDryFood == cat.FaveDryFood
        && FaveLitter == cat.FaveLitter;
    return false; // must be a dog or a ferret

The TypeCast() ends up being necessary because there's something common we want to do for all animals (e.g., make equality comparisons), so the functionality must go in the base class and the function signature must involve an AnimalClass object. In languages like Java, SmallTalk, and EnScript, programmers end up having to do this a lot.

In C++, which has templates, you can often make use of the Curiously Recurring Template Pattern.


sed quis custodiet ipsos custodes?

Fifty years on, Eisenhower's farewell address seems quite relevant to the times, and the readers of this blog.

A quibble:
Yet, in holding scientific research and discovery in respect, as we should, we must also be alert to the equal and opposite danger that public policy could itself become the captive of a scientific-technological elite.

In 2011, when any village idiot/realtor can comment on which NSF grants do not have merit, it is safe to say that public policy is in no danger of becoming captive to the scientific-technological elite. (By the way, Jacques Barzun's The House of Intellect is a great exploration of anti-intellectualism in the U.S. [I hope I look that good when I'm 102, and I'm unsurprised the Presbyterians played host.])

However, a core tenet of the contemporary homeland defender's creed is "we don't set policy." The scientific-technological elite, to which you, dear reader, no doubt belong, strives to uphold the Unix ideal, providing mechanism, not policy... capability, not culpability. This is a cop-out. You can't sit Grandma down in front of a bash prompt and tell her she'll be just fine with man man.

Our clients, those whom we serve, likely do not understand what it is we do, what we know, what we can do, and, most importantly, what should be done. If we in the Cyber Industrial Complex* do not act with restraint, if we do not "conduct our struggle on the high plane of dignity and discipline," then we risk legitimizing the paranoid delusionals, and the lawless actions of their defenders. "Thy will be done" may be the easy interpretation our masters desire, but when our masters are not omniscient, or lack sufficient wisdom, then, ever mindful of our own weaknesses and ever penitent, we should strive to be of good counsel. Machine learning, SNA, disinformation, and old school psyops are good tools, but only if we have good targets.

* Military Cyber Complex is maybe more apt, but less poetic.

Update: Penitence requires more than an apology, but it's a good start.

Update 2: The paper of record 

Comment: What about severing ties to Hunton & Williams? Severing ties to a tiny firm like HB Gary Federal seems like kicking a dead dog.