Hadoop 2 Pushes Big Data Further Into The Mainstream

Doug Cutting Hadoop Cloudera

The new Apache Hadoop promises a world where Big Data players work together, says Doug Cutting

The Big Data world is on a much firmer footing, as Apache Foundation’s Hadoop version 2 makes the platform more usable for business, and vendors are exploring possibilities to deliver easier-to-use software, according to the founder of the Hadoop movement.

Version 2 of the open source project includes Yarn, a job scheduler which allows different kinds of tasks to be managed on a Hadoop system. Future work on the standard is also going to be more organised, with less risk of individual projects splitting off from the mainstream, according to Doug Cutting, of Cloudera – and his company has a change of licensing structure on the way, too.

Hadoop 2 – a social milestone

fast elephant hadoop real time big data © shutterstock 1971yes

“Socially, Hadoop 2 is a milestone. We are now all working on the same thing,” Cutting  told TechWeekEurope at O’Reilly’s Strata event in London – a major Big Data fest. “We have methodologies where we can collaborate effectively.”

There is a plethora of open source projects around Hadoop, handling different data stores and different ways to access and search data. Those will continue – and Cutting sees the variety as a major strength of the Hadoop community, but he said there is now a “much more unified horizon” for the technical work: “People were doing things piecemeal, now we are all clicking along together.”

New items and branches can be created and developed before being passed into the mainstream of Hadoop, and this procedure is now better agreed, he told us: “There was a lot of suspicion about groups of developers feeling others might break things or saddle them with something that is not ready.”  Now, this is done in a well-organised manner.

“We are executing as a set of competitors and moving it forward,” he said.

Apache prefers projects that are diverse, with developers from multiple vendors, he says because there is less risk of things being decided offline, in the company which is effectively running the project.

That doesn’t stop some of these projects succeeding: for instance Cloudera’s Impala real-time query system is open source, and has been used by its rivals, such as MapR, despite being effectively under Cloudera’s control.  Projects like this can effectively become more widely owned in future if rivals decide to get involved in building them, Cutting said.

It’s more worrying when things in the Hadoop space fork or spin off outside the Apache Foundation’s oversight. There are some things which might never go into Apache, and some are even proprietary code,  but this is “not inherently a bad thing,” said Cutting. It does create risk, but it also allows vendors to create things that are beyond what is available in the central Hadoop projects.

Cutting doesn’t warn about anything specific, but does point out that proprietary tools come with an element of risk.

In Cloudera’s case, it has created proprietary management tools, and they will stay proprietary, because they are specific to Cloudera, and he doesn’t want to incur the overhead of making them open source.

But Cutting emphasises that the fundamentals remain open source: “Anything that is storing or processing data, that will continue to be open source. We don’t want people to be locked in at the application level. They need to own their data and the processing.”

Spin us a Yarn

The biggest thing in Hadoop 2 is Yarn, he explained: “That’s  a scheduler, which lets you have different loads sharing CPU and I/O more effectively”. Thus MapReduce and Cloudera’s Impala can have dynamic allocations, and co-ordinate what each one is using at a given time.

CopridCloudera is already making use of Yarn to allow for service level agreements (SLAs) and to apportion jobs according to departments within the organisation. Where it has been possible to give 50 percent of the resource to batch processing, it’s now possible to say the IT department has 50 percent of the resource for mission critical jobs.

There are signs of a new direction from Cloudera, away from the traditional open source model where you simply buy support to a complex bunch of packages, towards something where customers buy an off-the-shelf package.

“We are trying to define the thing which institutions want,” he said. “One place to keep their data, with a wide variety of functions, management tools, tools to track lineage, search and disaster recovery. All these things together in a single offering.”

Until now, Cloudera was simply a Hadoop offering with add-ons, he said: “That’s not what people are really looking for.  What they really want is this thing they can install.”

People want Hadoop off-the-shelf, and they don’t want to have to choose tools, he said: “They want the full scope that is available, so they don’t have to pick and choose. You are going to need all of it at some point. That’s a promise!”

Does this mean a change in business model, and a change in pricing? That’s not a question to ask the architect, and he tells me there are changes coming, but he can’t pre-announce them. “We are trying to raise our game,” is all he would say about that.

How much do you know about storage devices? Try our quiz!