2011-02-23

Next Generation of Hadoop

The last couple of weeks has seen some new information trickle out from Yahoo! about their efforts to improve scalability on Hadoop. (For a sense of a scale to the uninitiated, by "improve scalability" I mean, "scale beyond clusters of 4,000 servers.") Yahoo! is calling this effort Next Generation Hadoop, or, Hadoop .Next.

To paraphrase V.S. Naipaul on the New Yorker and fiction, Yahoo! knows nothing about enterprise software marketing, nothing.

To those paying attention, it's clear there's been tension between Yahoo!, Hadoop's first and most important patron, and Cloudera, the usurper. It's not hard to play armchair psychologist and speculate about the forces behind the tension, but I certainly don't know enough to comment on it intelligently. What is undeniable, though, is that release momentum hiccuped at a critical period in Hadoop's adoption by the industry at large, and that momentum has only recently been restored. Yahoo!'s recent statement about abandoning their own Hadoop distro and working to improve Apache trunk was very good news.

Still, remember that Hadoop is not yet that magical 1.0. It's changed enormously over the past year or two, and for the better. It's silly it's not considered 1.0 already, and it's clear that a 1.0 designation is coming down the pike.

THEREFORE: It makes no sense whatsoever to talk about "Next Generation Hadoop." All the technical reasons are sound, but here's what this sounds like to me: "We haven't hit 1.0 yet, but we've already come down with a terminal case of Second System Effect." Moreover, this sense of foreboding is not at all helped by the fact that the presentations and documents about Hadoop .Next have not included any information about the most important facet of a Hadoop refactoring: the API. So, not only do I have to worry about deciding between implementing Mapper or inheriting from Mapper, I'm now worried that I'll have to abandon Mapper for ConfigurableGenericJobTask and ConfigurableGenericJobTaskFactoryInterface and rewrite all my code to suit. Viva la Revolution.

I'm not really that worried. Mostly, Hadoop has been evolving according to a Teilhardian roadmap. However, it'd be fantastic if the community released 1.0 in 2011 and phased in Hadoop .Next incrementally, without making it seem so disruptive.

No comments:

Post a Comment