Using The Public Web For Big Data

Large-scale data as a service will let businesses combine the public web with private data, says Mark Little of Red Hat

Data, both public and private, is growing at an unfathomably rapid rate. In the time it takes you to read this sentence, the volume of data in existence worldwide will have grown by more than the volume of data contained in all of the books ever written. For instance, on YouTube alone, 35 hours of video are uploaded every minute.

Unstructured data is growing at a rate of 80 percent year-on-year. More of us live our lives online, and companies large and small capture massive volumes of customer information. As the regulatory framework around data retention tightens, data storage has evolved from a routine IT process to a major business issue.

EMC Cloud Big DataIn some industries, such as meteorology, seismology, power distribution and financial services, gathering and curating huge volumes of data are integral features of the business. These businesses have grown into Big Data and they have pockets deep enough to cover the cost of the necessary IT infrastructure. But businesses across all sectors are now engaging with their customers across multiple channels: on the web, across mobile devices, through social media and face to face. They are increasingly looking to unify all these different sources of data to analyse the ‘customer journey’ and target individuals proactively with tailored messaging.

Big Data is too big to get hold of

The benefits of capturing and securely analysing data should be obvious to all, yet according to a recent IDC report less than one percent of data is analysed and more than 80 percent of it remains unprotected. Why then is so little analysed and why does so much go unprotected? Put simply, Big Data is too big for most for all but the largest companies. Traditional storage models are not scalable enough, while the processing power required to quickly make sense of an almost limitless long tail of potentially useful information is beyond all but those with the deepest pockets.

Much has been written on the subject of Big Data, in fact, if you search the term using Google, it returns nearly 19 million results. Assuming it takes your browser about five seconds to open each link, that’s over 26,000 man hours to briefly glimpse each page, never mind take in, analyse and act upon the contents in an intelligent way that benefits your business. This should put into some sort of context the scale of the task of applying Big Data logic to both public and private sources, which is the idea behind a European Union initiative to create a Large-scale Elastic Architecture for Data-as-a-Service (LEADS) that businesses can use to mine and analyze data published on the entire public web.

The objective of LEADS is to build a decentralized DaaS framework that runs on an elastic collection of micro-clouds. LEADS will provide a means to gather, store, and query publicly available data, as well as process this data in real-time. In addition, the public data can be enriched with private data maintained on behalf of a client, and the processing of the real-time data can be augmented with historical versions of the public and private data.

Data as a service solves Big Data

Cloud-based or DaaS models look most likely to provide the answer to our Big Data challenges. Rather than maintaining data in-house in dedicated data centres, it is both economically and ecologically smarter to store it in a shared, open source, infrastructure.

The LEADS consortium is composed of universities and research centres (UniNE, TUD, and TSI), whose members proved to be able to push new ideas rapidly and effectively to the scientific community and large companies (Red Hat, Yahoo! and Adidas) which have the power and legitimacy to effectively propose new technologies and methods both for their own use, for their clients and as new standards of operation. Most importantly, Yahoo! and Red Hat are leaders in the development and distribution of open-source solution and the expected impact of providing the LEADS infrastructure to the community is enormous.

Clearly, the financial investment needed for crawling, storing, and processing even a small portion of the internet is very high, making such a task prohibitively expensive for small to medium and start-up companies. Currently, only the biggest IT players have access to the infrastructure for storing huge amounts of data and the computing facilities needed to process it. Small and medium companies often have no other choice than relying on larger companies with dedicated data centres to provide them with the data and processing resources.

The monetary cost of the infrastructure is among the critical factors determining how to store Big Data. As noted, this problem is especially acute for small and medium-sized companies that have limited resources. Therefore, any new solution should offer pricing competitive with, or lower than, conventional data centres to be attractive.

Break down the data warehouses

In unpredictable markets, data warehousing will no longer suffice. Here, organisations process large volumes of data into certain forms so that predefined types of analyses can be performed. This is too restrictive: who can foresee what analysis might be needed next month, let alone next year? Businesses must have the freedom to keep their options open. This is where the model of Big Data comes into its own. It allows unstructured data such as log files, virtual machines, email, audio, video and documents to be interrogated in new ways to extract fresh insight, thereby maximising its value to the business over the longer term.

Using data this way helps companies to make better informed forecasts and decisions. Companies taking advantage of Big Data are equipped to understand their business at a deeper level. They can respond nimbly to changing conditions, take innovative products and services to market more quickly and steal a march on the competition.

The LEADS platform will be designed from the ground up to account for privacy, security, energy-efficiency, availability, elastic scalability, and performance considerations. The project will be validated on use-cases involving the crawling of web data and its exploitation in different application domains.

LEADS builds on the open source heritage in the Big Data field. In selecting technology partners the European Union needed to avoid proprietary providers from the outset. Choosing between products that purportedly offer a Big Data solution can be a large enough hurdle to scare even the most robust IT managers. All of the main proprietary storage vendors have a solution in the Big Data space, typically a package comprising their own hardware with preconfigured software. However, open source software offers an alternative way of creating a cost-effective storage solution. Software built using standard software components based on standard protocols is layered on top of commodity hardware. This opens up a vendor-neutral route, whereby user organisations have a wide choice of affordable hardware and portable open source software available to them.

Using open source for LEADS avoided the project becoming locked into a particular hardware vendor or the high software licensing costs associated with proprietary operating systems, middleware and applications.

Infinite pools of data

The LEADS system uses a scale-out solution for unstructured data that can grow as required, creating an infinitely large pool of data. It’s a solution that spans seamlessly into the cloud, and deployed in places including on Amazon’s public cloud services, without any need to rewrite code. By accessing cloud-based storage resources, capacity can then be turned on or off according to demand. This is particularly useful where customer demand is challenging to predict.

Of course, predicting demand, particularly in recent years, is a major headache. Make no mistake, data growth is becoming the biggest challenge for enterprises to manage their own data centre hardware infrastructure. The economic downtown forced many IT managers to defer infrastructure and technology upgrade cycles. LEADS offers an economical approach to processing large amounts of data by sharing the collection, storage and querying of public and private data. Combining privately held data with publicly available ‘free’ data is a logical next step for Big Data. Google managed to evolve the simple action of searching publicly available information into a $50 billion business. Just imagine what LEADS could do for your business.

Mark Little is vice president of middleware engineering, and Tom Llewellyn is business development manager for storage at Red Hat.