Google Launches Service For Managing Hadoop, Spark Clusters

elephant hadoop big data calculation database sql © Marie C Fields Shutterstock

Cloud Dataproc will make it easier to administer and manage clusters, the company says

Big data analytics technologies such as Hadoop and Spark can help organizations extract business value from massive data sets, but they can be very complex to administer and to manage.

Hoping to help reduce some of that complexity, Google Wednesday announced the launch of a new service dubbed Cloud Dataproc for customers of its cloud platform. The service is currently available only in beta and is designed to minimize the time businesses spend on administering and managing computing clusters in Hadoop and Spark environments.

The company described Cloud Dataproc as a managed Spark and Hadoop service that lets customers of Google’s Cloud Platform create clusters more quickly, manage them more efficiently and save money by allowing them to turn clusters on and off as needed.

Google Hadoop

Elephant fire spark apache hadoop big data © HeartBeat ShutterstockIn a blog post Wednesday, Google Product Manager James Malone listed several features of Cloud Dataproc that he claimed makes the service better than on-premises products and competing services.

Cloud Dataproc, for instance, makes it much faster for enterprises to create and run Spark and Hadoop clusters compared to doing the same thing with on-premises clusters and rival infrastructure-as-a-service (IaaS) platforms, Malone said.

The average time it takes with Cloud Dataproc to start, scale or shut down Hadoop and Spark clusters is 90 seconds or less per operation, compared with between 5 and 30 minutes with on-premises technologies and other IaaS vendors, he claimed.

Cloud Dataproc is also tightly integrated with other Google cloud services such as BigQuery Cloud Logging, Cloud Monitoring and Cloud storage, making it a comprehensive data platform, Malone said. “For example, you can use Cloud Dataproc to effortlessly ETL [extract, transform and load] terabytes of raw log data directly into BigQuery for business reporting,” he noted.

Malone touted Cloud Dataproc’s pricing model as another advantage over alternate options. Google, for instance, currently charges only 1 cent per hour per CPU in a cluster, he said. That price can go down even further if a business chooses Google’s recently announced pre-emptible virtual machines option for running their workloads, he said.

Cloud tools

Google’s pre-emptible VMs allow enterprises to rent out extra infrastructure capacity from the company really cheaply to run short-duration workloads on the condition that the extra capacity can be pre-empted at any time to run regular workloads. “Instead of rounding your usage up to the nearest hour, Cloud Dataproc charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period,” Malone said.

Cloud administrators do not need to have to learn any new tools or APIs to be able to use Cloud Dataproc. Google’s Developer Console allows administrators to interact with Spark and Hadoop clusters without any handholding, he added.

Cloud Dataproc adds to a rapidly growing portfolio of tools from Google for working with large datasets and workloads in the cloud. In August, for example, the company boosted performance of its BigQuery data analytics service with new user-defined functions and usability improvements.

Earlier this year, the company announced a new Cloud Monitoring service designed to let enterprises monitor performance, availability and capacity of key Google services like Apps Engine, Cloud SQL and Compute Engine.

Take our data centre quiz here!

Originally published on eWeek.