Big Data: Ecosystems and Infrastructure

Big Data Ecosystems and Infrastructures

As data proliferates, creating robust, integrated and secure infrastructures is essential for all businesses no matter their size. Understanding how a Big Data ecosystems evolve, ensures the value of data is always revealed.

Big Data infrastructures are multi-faceted. Removing data siloes and integrating these datasets, offers a range of new possibilities that have until now, been hidden from view.

The traditional approach to managing large datasets has been to create layers within the infrastructure. Today, as the hybrid cloud has taken shape and, begun to dominate how businesses approach their data management challenges, Big Data ecosystems have evolved to offer new scalable systems that can expand as needed.

These new Big Data ecosystems are also making use of rapidly expanding technologies, most notably AI. Using systems like Machine Learning can unlock the value within Big Data with new analytical platforms. Ecosystems, though, will offer much more than just better data analysis. For consumers, their propensity to use several parallel channels when making purchases means ecosystems – with Big Data at their foundation – must be flexible, integrated and secure, at every touchpoint.

Says McKinsey: “Ecosystem relationships are making it possible to better meet rising customer expectations. The mobile Internet, the data-crunching power of advanced analytics, and the maturation of AI have led consumers to expect fully personalized solutions, delivered in milliseconds. Ecosystem orchestrators use data to connect the dots—by, for example, linking all possible producers with all possible customers, and, increasingly, by predicting the needs of customers before they are articulated.”

David Gonzalez, Group Head of Big Data and AI for Vodafone Business told Silicon: “The explosion of Big Data in recent years has paved the way for the widespread use of data analytics across a range of industries. The real value will be drawn from the intelligence it offers, especially at scale. In terms of a supporting ecosystem, the main characteristics can be summarised by describing where data is stored, i.e. on-prem or in the cloud, and in secure data lakes which store all data sources and structures. The processes involved in delivering intelligent insights tend to be fully automated in to win in speed and efficiency and are able to deliver AI at scale.”

The data landscape is now a complex ecosystem. Businesses that can develop infrastructures to manage these vast datasets will have a competitive advantage. These new data environments must be built with strategic planning. Supporting the customer experience, enhancing the abilities of workforces and evolving the IT spaces business now inhabit, are all possible when Big Data is placed within a clearly defined development roadmap.

Big Data Infrastructures

The composition of any given data ecosystem has several key drivers: Says Susan Bowen, CEO of Aptum: “Budget constraints are always a challenge for any business. As a company obtains more data, they need to invest in more infrastructure to facilitate this expansion. But, by outsourcing to hyper-scalers, through SaaS solutions, scalability demands can be met and large costs reduced.”

Bowen continued: “Security is also at the forefront of CTO and CIO’s scaling strategies, as the overall number of devices and users who have access to ecosystems begins to rise. The main challenge lies in being able to audit control and manage the different access and endpoints. Hyper-scalers are extremely secure platforms, but the configuration of an effective security strategy rests on an enterprise’s ability to reflect upon its vulnerabilities.”

Also, Matt Yonkovit, Chief Experience Officer, Percona, explained to Silicon: “CIOs and CTOs want to avoid getting locked into a single provider. They don’t want to repeat the scenario in the 90s, and early 00’s where Microsoft or Oracle-owned enterprise IT. However, multi-cloud is still on the drawing board at the minute for most companies outside the largest enterprises today. That will develop as more people get skilled with more than one cloud platform. Multi-Database environments are the norm but, getting them to “share” data between them is a challenge.”

The continuing digital transformation of businesses has continued. Coupled with the adoption of new tools is the desire to build agile enterprises. This agility is today, based on a company’s ability to manage the data lakes they have created. Not merely an IT exercise, managing and then extracting meaningful, actionable information is a core driver to deliver the levels of agility businesses are striving to achieve.

McKinsey makes the point that a flexible approach to Big Data is vital to ensure its true value can be observed: “As data lakes move from being pilot projects to core elements of the data architecture, business and technology leaders will need to reconsider their governance strategies. Specifically, they must learn to balance the rigidity of traditional data oversight against the need for flexibility as data are rapidly collected and used in a digital world.”

Big Data ecosystems will also embrace new technologies including, 5G and edge computing. The days of vast centralized data stores are over. Decentralized pockets of information at the edge of a network, which itself is connected via high-speed 5G, will create an ecosystem for Big Data to thrive within. Analytics and applications will also live in this data ecosystem. Spark and MapR are expanding to offer the insights into the vast data sets businesses are collecting.

Vodafone’s David Gonzalez offered this advice to CIOs and CTOs: “There are multiple challenges facing CIOs and CTOs, most of which are related to managing expectations around timing. Sometimes projects can take several months and, in some cases, even a year to complete. It’s essential that these timeframes are communicated and aligned with the business, prioritizing high-value use cases.

“Key processes, such as data ingestion from multiple sources, must be built fully automated and scalable. Additionally, defining a successful operating model with new positions like cloud DevOps and support or data engineers is critical. Ensuring key principles like data governance and data quality underpin all processes is key, and all data must be clearly documented.”

Creating new data infrastructures that shape the Big Data ecosystem means understanding multiple and parallel information streams, all of which have a symbiotic relationship with each other. Big Data needs significant approaches to the infrastructures that support what increasingly become massive datasets.

The resulting Big Data ecosystems are far-reaching, diverse and contextual. Businesses that create robust, secure and flexible environments for their data, will be able to construct new and highly lucrative relationships with their customers, business partners and empower their workforces to innovate continually.

Silicon in Focus

Ben Stopford, Lead Technologist, Office of the CTO at Confluent.

Ben Stopford, Lead Technologist, Office of the CTO at Confluent
Ben Stopford, Lead Technologist, Office of the CTO at Confluent.

Ben Stopford is Lead Technologist, Office of the CTO at Confluent, working on and around Apache Kafka. He has more than two decades of experience in the field and is the author of the book “Designing Event Driven Systems.”

What are the key characteristics of today’s Big Data ecosystems, and how are businesses building their data infrastructures to support their continued use of Big Data?

The Big Data world has undoubtedly changed over the past decade. Early projects typically involved collecting large datasets then searching for some pot of gold at the end of the proverbial rainbow. Today’s projects are more discerning, often trading ‘collecting everything’ for a focus on mission-criticality. This often leads to comparatively smaller datasets being analyzed but with outputs that are more closely aligned to business value.

Beyond this, we see several trends emerging:

Real-time datasets:

The traditional approach of extracting batches of data from relational databases produces static snapshots typically on a daily cadence. While these snapshots are useful for end-of-day reporting and financial reports, they are not well suited to the vast majority of real-time use cases that drive most businesses today.

A second advantage is that real-time event streams provide an accurate time series of what occurred in the real world rather than a “daily summary” of what the source system considered to be relevant. By transmitting events as they occur, more information is available, allowing analytics systems the flexibility to manipulate this data in new ways.

Architectures that promote data quality:

Early big data projects were often plagued with data quality issues. If these aren’t fixed at the source, the result is a cycle of inadequacy where data is ‘fixed’ downstream, coupling the downstream systems to idiosyncrasies of the infected data sets. The result is that companies lose trust in their data leading to brittle data systems that can’t keep up with the pace of the business.
Three important steps have helped companies alleviate this pain: Fixing data at the source, adopting an organizational policy where data owners behave responsibly, and finally, implementing hub-and-spoke event delivery through an Event Streaming Platform that avoids dataflows that are chained from system to system.

Fostering both ‘raw’ and ‘curated’ datasets:

In a company, there are many different users of data, and their needs vary. Some want raw datasets and are willing to perform the data wrangling necessary to get them into shape. Others wish to curated datasets, so they don’t have to put the work in to understand the many different event sources that contribute to them. We see organizations starting to combine both approaches in a single Event Streaming Platform that caters to both groups.

A preference for line-of-business “views” over large shared clusters:

The popularity of one-cluster-to-rule-them-all has waned. Today, big data is increasingly an application-level concern, where individual teams incorporate data analysis into their applications rather than relying on a central data team.

Cloud-based services accentuate this trend by making previously expensive data wrangling tools more accessible to application teams. Event Streaming Platforms then provision data to these different tools, letting these teams make use of real-time data exactly where they need it.

Are businesses continuing to use Hadoop coupled with services like AWS, to build the data ecosystems they need to make the best use of the data they are collecting?

There is a marked trend away from Hadoop, particularly as companies move to the cloud. For early, on-premise use cases in the 2000s, Hadoop made sense. The barrier to entry was high – too high for many individual application teams to cope with. This led to centralized Hadoop infrastructure that followed the pattern introduced by the corporate data warehouses of old.
But with the onset of cloud-based services, this has changed.

Project teams no longer need to face the eye-watering upfront cost of these multi-layer data processing systems. Instead, if they can get access to real-time event streams that decentralize organizational data, cloud-native services provide a far more accessible path. This opens up big data not only to these large centralized projects but also to individual application teams. The downfall of MapR and troubles at Cloudera/Hortonworks are good evidence of this trend.

What are the current challenges CIOs and CTOs face when building Big Data ecosystems and infrastructure?

Two of the biggest challenges we see are getting access to high-quality data sources and securing data effectively. Both of these are hard problems for many businesses because they become exponentially more difficult as the complexity of the software estate grows. The more moving parts, the more likely data quality issues will compound as data flows through the architecture — the more components, and the more interfaces that must be secured effectively.

Many CIOs are turning to event streaming as a solution to both of these issues because they simplify the architecture from a pipeline, which can behave a bit like a game of Chinese Whispers, to a hub and spoke approach which is easier to secure and keeps endpoints closer to the various data sources.

Are there any trends you can identify driving the creation of today’s Big Data ecosystems?

There is an interesting trend we observe as companies transition from on-premise offerings to the cloud. In the first phase of this transition, open-source infrastructure was lifted and shifted to the cloud, where it was either self-hosted by companies or provided as-a-service by cloud service providers.

We’re now seeing the second phase of this transition where the interface to these open source technologies remains, but the software that backs them is being augmented, rewritten and replaced. Best-in-breed managed services have fully serverless backends that are more functional and use economies of scale to outperform or outprice their open-source ancestors significantly.

For end-users, this has two implications. Firstly, it drives down costs but, more importantly, it makes these technologies accessible to application teams with smaller use cases and smaller budgets while giving them the potential to grow, should their use cases become more data-intensive in the future.