Twitter Plots To Storm Growing Big Data Market

Twitter Storm Big Data top

Storm, from Twitter’s BackData, will provide new ways of reputation tracking in realtime – and more

Twitter is preparing to announce a competitor for Apache Hadoop in open source Big Data world.

The micro-blogging company bought BackType Technology for its data analytics product Storm at the beginning of July. Like Hadoop, Storm will be an open-source developed product – a process that was already underway before the acquisition. It will also compete with products such as the open-sourced S4 from Yahoo, and DarkStar from Cloud Event Processing.

Brand Reputation Tracking And More

BackType’s main product to date has been BackTweets which has helped companies to understand the influence of their tweets on the wider world, but Storm is a global data-processing tool which the company has called the “Hadoop of realtime processing” – leading Hadoop to threaten court action, according to Nathan Marz, lead engineer for BackType. Storm has been devised to search the Web to find out what people are saying about a topic of interest.

Products like Hadoop and BackType are being promoted as targeted replacements for RSS feed. Where RSS generally pushes any new entry into a specific Website’s news or blog feeds to subscribers, these powerful data analytics products are developing the power to stream specific topics in a unified feed, or cluster. In this area, BackType claims to have beaten Hadoop to the punch.

Announcing the imminent release of Twitter’s Storm, Marz blogged, “A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run ‘MapReduce jobs’, on Storm you run ‘topologies’. ‘Jobs’ and ‘topologies’ themselves are very different – one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).”

Marz outlines three uses for Storm in his blog but adds that these are just examples and that many more will be revealed at the launch during the Strange Loop developer and software architect conference in September.

Storm is fault-tolerant and scalable, he wrote, and can be used to process a stream of new data and update databases in realtime. It can also do a continuous query and stream the results to clients in realtime. The third use case he mentioned is distributed Remote Procedure Calls (RPCs). Storm can be used to run an intense query on the fly in parallel. The Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelising search queries or doing set operations on large numbers of large data sets.

An example given of the streaming capabilities is to direct a search at trending topics on Twitter through a browser. The browser would have a realtime view on what the trending topics as they occur. This could easily be adapted for searching through terabytes of data looking for a specific product mention and alerting the browser user as each mention occurs.

Marz’s claim that Storm is easy to use could attract those who find Hadoop’s environment rather difficult to fathom and need a simpler, realtime offering.