Tugdual Grall, technical evangelist at MapR Technologies, looks at the rise of Apache Spark and why everyone’s talking about it
There is a plethora of new technologies entering the big data landscape, but perhaps the most avidly discussed in 2015 was Apache Spark.
Some view this tool as a more accessible and powerful alternative to Hadoop, while others argue Spark can be used as a powerful complement to Hadoop, with its particular strengths and quirks. But what are the facts? Who is using Spark and how does it differ from other data processing engines?
What is Spark?
An all-purpose data processing engine, Spark can be used for a variety of operations. Data scientists and application developers can integrate Spark into their applications to query, analyse, and transform data quickly and at scale. Spark is optimised to operate in-memory, allowing it to process data much quicker than other solutions like Hadoop’s MapReduce, which often read/writes data on computer hard drives in between each stage of processing. Some claim that when running in-memory, Spark can be up to 100 times faster than MapReduce. But this is not a true comparison, as the raw speed is often more important to Spark’s common use cases than it is to batch processing – and solutions like MapReduce still excel at processing.
What can it do?
Spark can manage several petabytes of data at once, distributed over a cluster of thousands of cooperating virtual or physical servers. Supporting languages such as Python, R, Scala and Java add to its flexibility, meaning that it is suited to work across a range of different use cases. It is often deployed alongside a distributed file system such as HDFS – Hadoop’s data storage model, or MapR File System, but it also integrates well with other popular data storage options like MapR-DB, MongoDB, HBase, Cassandra, and Amazon’s S3.
Some common use cases include:
Machine learning: Machine learning approaches become more achievable and accurate as data volume grows. Training software to identify and act upon triggers found in well-understood data sets allows the solution to be applied to new and unknown data. The ability to store data in memory to quickly run repeated queries has led to it becoming a popular choice for training machine learning algorithms.
Interactive analytics: Data scientists and business analysts explore data by asking a question, looking at the result, and then fine-tuning the question or drilling down further into the results. Whether they are looking at production line productivity or stock prices, Spark allows analysts to interact with the data to gain greater insight.
Data integration: Analysing and reporting on data produced by different systems in a business is particularly hard, as it is rarely clean or consistent enough to be simply combined. Spark (and also Hadoop) are now being deployed in many organisations to reduce the time and cost of cleaning up and standardising data, before loading it into a separate system for analysis.
Stream processing: The use cases described above are now executed in real time, so data must be capture immediately by the system. Spark allows application developers to “stream” the data in real time to process it. Financial transaction “streams,” for example, can be processed in real time to combat fraud by identifying – and subsequently refusing – fraudulent transactions. Spark Streaming can capture data from various sources such as file systems, network sockets, and other components from the big data ecosystem, such as Flume, Apache Kafka, and MapR Streams.
What sets it apart?
Spark’s potential for interactive querying and machine learning has caught the attention of a range of vendors looking at it as an opportunity to extend their existing big data solutions. Significant sums are being invested in this technology, and an increasing number of startups are building entire businesses dependent wholly or in part on Spark.
What is more, all major Hadoop vendors are now supporting Spark alongside their existing products, with each vendor working to help customers harness its valuable qualities. Training courses on Spark are also growing, increasing skill sets for both businesses and individuals.
Why are people choosing Spark?
Simplicity: Accessible via a set of rich APIs, available in multiple programming languages, Spark’s capabilities are all designed specifically to allow for quick and easy interaction with data at scale. These well-documented and structured APIs help make it straightforward for application developers and data scientists to quickly put Spark to work.
Support: Supporting a range of programming languages as well as including native support enables tight integration of Spark with many leading storage solutions in the Hadoop ecosystem and beyond. Furthermore, the Spark community is active, large and international.
Speed: Spark is ultimately designed for speed, achieved through operating both in-memory and on disk.
Without a doubt, Spark has vast potential and will continue to gain momentum as data scientists and business analysts increasingly recognise its capabilities. However, similar to many big data offerings, Spark is no miracle solution: it is not a “one size fits all” solution, nor is it the best choice for every data processing task. Over the next year, we’re likely to see even greater discussions around the future of Spark and more examples of its potential, as further uses are brought to market.