Hadoop goes beyond batch to real time with Impala
The open source Big Data engine Hadoop has been given more real-time abilities with Impala, which lets users query Hadoop data more quickly, either through the HBase structure, or directly into HDFS (the Hadoop Distributed File System).
Although Hadoop has allowed much bigger data sets to be handled, it has been a batch-oriented system until now. The arrival of Impala and the Real Time Query (RTQ) language changes were announced by Mike Olsen, CEO of the Hadoop specialist Cloudera, in a keynote at the O’Reilly Strata + Hadoop World event in New York today.
“In analytics, getting the answer back in seconds versus minutes can be revolutionary,” Cloudera COO Kirk Dunn (pictured) told TechWeekEurope. “We can do that on large data sets.”
Crazy for Hadoop
Cloudera has been promoting Hadoop using its HBase add-on, a table overlay which makes it look more like a relational database, but has limited users to batch-like operation, albeit on vastly larger sets of complex data.
Impala is a speedy parallel query engine which operates in real time, so users can query either the native HDFS file system or the HBase store, using the new RTQ. It uses a fairly straightforward SQL-like syntax, similar to that of Cloudera’s Hive tool.
Impala is open source of course, and will be available as a public beta download from Cloudera’s site. From January, subscription-based support and management tools will be available for it, as they are for the existing CDH set of Cloudera tools, Dunn told us.
The idea is that through its real-time ability, Cloudera will be able to compete outside the specialised world of Big Data, taking the competition back to where people use traditional relational databases and data warehouses, but offering a lower-cost approach to either.
Impala was developed by lead architect Marcel Kornacker, who previously built the F1 query engine at Google. Cloudera says Impala works with a more flexible data model using more complex information than a data warehouse while still using more-or-less standard SQL.
Analysts have been waiting for this development. “We have already seen high levels of interest in and adoption of Hadoop by enterprises for low-cost storage and transformational processing of large volumes of data, but have argued that for Hadoop to gain more adoption for analytic workloads we need to see analytic tools taking full advantage of Hadoop’s scalable parallel processing architecture,” said Matt Aslett, research manager, data management and analytics, 451 Research.
“Cloudera Enterprise RTQ and Cloudera Impala look to be a significant step in enabling enterprises to take advantage of existing SQL skills and tools to realise the potential of real-time analytics against large volumes of structured and unstructured data stored in Hadoop.”
“Enterprises have grown accustomed to interactive querying and on-the-spot analytics with their existing data warehousing and BI infrastructures and will expect no less of Hadoop,” added Tony Baer, principal analyst for Ovum. “With a real-time query capability powered by its new Impala engine, Cloudera is striving to level the playing field in performance and accessibility with massively parallel SQL platforms.”
The new function already has partners and users. Capgemini Financial Services, Karmasphere, MicroStrategy, Pentaho, Qlikview, and Tableau have all validated their solutions with Impala, and are pleased with what they see. Jojy Mathew, vice president and chief solution strategist at Capgemini Financial Services, said: “We are currently integrating the product with a wide array of proven Big Data use cases that we have developed.”
Meanwhile, analytics vendor MicroStrategy has optimised its queries for Impala and will be offering it to customers.
Partners such as Oracle are keen to work with Impala as well, said Dunn, adding that the Hadoop World event is a sell-out with more than 2000 delegates and plenty of announcements from other Hadoop partners.
How much do you know about storage devices? Try our quiz!