Big DataData StorageOpen SourceSoftwareWorkspace

Pentaho Open-Sources ‘Big Data’ Tools

Darryl K. Taft covers IBM, big data and a number of other topics for TechWeekEurope and eWeek

Pentaho has open sourced its big data analytics tools, known as the Pentaho Kettle project

‘Big data’ need not be expensive after Pentaho announced that it has open-sourced its big data analytics tools, known as the Pentaho Kettle project.

Pentaho said it has made freely available under open source all of its big data capabilities in the new Pentaho Kettle 4.3 release, and has moved the entire Pentaho Kettle project to the Apache License, version 2.0.

Because Apache is the licence under which Hadoop and several of the leading NoSQL databases are published, this move will further accelerate the rapid adoption of Pentaho Kettle for Big Data by developers, analysts and data scientists as the go-to tool for taking advantage of big data, the company said.

Big Data

Big data capabilities available under open-source Pentaho Kettle 4.3 include the ability to input, output, manipulate and report on data using the following Hadoop and NoSQL stores: Cassandra, Hadoop HDFS, Hadoop MapReduce, Hadapt, HBase, Hive, HPCC Systems and MongoDB.

“In order to obtain broader market adoption of big data technology including Hadoop and NoSQL, Pentaho is open sourcing its data integration product under the free Apache license,” said Matt Casters, founder and chief architect of the Pentaho Kettle Project, in a statement. “This will foster success and productivity for developers, analysts and data scientists giving them one tool for data integration and access to discovery and visualisation.”

With regard to Hadoop, Pentaho Kettle makes available job orchestration steps for Hadoop, Amazon Elastic MapReduce, Pentaho MapReduce, HDFS File Operations and Pig scripts. All major Hadoop distributions are supported, including Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition. Pentaho Kettle can execute ETL transforms outside the Hadoop cluster or within the nodes of the cluster, taking advantage of Hadoop’s distributed processing and reliability.

Productivity Increase

Pentaho officials said Pentaho Kettle for Big Data delivers at least a tenfold boost in productivity for developers through visual tools that eliminate the need to write code such as Hadoop MapReduce Java programs, Pig scripts, Hive queries, or NoSQL database queries and scripts.

In addition, Pentaho officials said Pentaho Kettle also:

  • Makes big data platforms usable for a huge breadth of developers, whereas previously big data platforms were usable only by the geekiest of geeks with deep developer skills such as the ability write Java MapReduce jobs and Pig scripts;
  • Enables easy visual orchestration of big data tasks such as Hadoop MapReduce jobs, Pentaho MapReduce jobs, Pig scripts, Hive queries and HBase queries, as well as traditional IT tasks such as data mart/warehouse loads and operational data extract-transform-load jobs;
  • Leverages the full capabilities of each big data platform through Pentaho Kettle’s native integration with each one, while enabling easy co-existence and migration between big data platforms and traditional relational databases; and
  • Provides an easy on-ramp to the full data discovery and visualisation capabilities of Pentaho Business Analytics, including reporting, dashboards, interactive data analysis, data mining and predictive analysis.