the leader in Unified Analytics and founded by the original creators of
Apache Spark™, today announced a new open source project called Delta
Lake to deliver reliability to data lakes. Delta Lake is the first
production-ready open source technology to provide data lake reliability
for both batch and streaming data. This new open source project will
enable organizations to transform their existing messy data lakes into
clean Delta Lakes with high quality data, thereby accelerating their
data and machine learning initiatives.
Watch the Spark
+ AI Summit 2019 keynotes live.
While attractive as an initial sink for data, data lakes suffer from
data reliability challenges. Unreliable data in data lakes prevents
organizations from deriving business insights quickly and significantly
slows down strategic machine learning initiatives. Data reliability
challenges derive from failed writes, schema mismatches and data
inconsistencies when mixing batch and streaming data, and supporting
multiple writers and readers simultaneously.
“Today, nearly every company has a data lake they are trying to gain
insights from, but data lakes have proven to lack data reliability.
Delta Lake has eliminated these challenges for hundreds of enterprises.
By making Delta Lake open source, developers will be able to easily
build reliable data lakes and turn them into ‘Delta Lakes’,” said Ali
Ghodsi, cofounder and CEO at Databricks.
Delta Lake delivers reliability by managing transactions across
streaming and batch data and across multiple simultaneous readers and
writers. Delta Lakes can be easily plugged into any Apache Spark job as
a data source, enabling organizations to gain data reliability with
minimal change to their data architectures. With Delta
Lake, organizations no longer need to spend resources building
complex and fragile data pipelines to move data across systems. Instead,
developers can have hundreds of applications reliably upload and query
data at scale.
With Delta Lake, developers will be able to undertake local development
and debugging on their laptops to quickly develop data pipelines. They
will be able to access earlier versions of their data for audits,
rollbacks or reproducing machine learning experiments. They will also be
able to convert their existing Parquet, a commonly used data format to
store large datasets, files to Delta Lakes in-place, thus avoiding the
need for substantial reading and rewriting.
The Delta Lake project can be found at delta.io
and is under the permissive Apache 2.0 license. This technology is
deployed in production by organizations such as Viacom, Edmunds, Riot
Games and McGraw Hill.
“We’ve believed right from the onset that innovation happens in
collaboration – not isolation. This belief led to the creation of the
Spark project and MLflow. Delta Lake will foster a thriving community of
developers collaborating to improve data lake reliability and accelerate
machine learning initiatives,” added Ghodsi.
For more information on Delta Lake, follow @DeltaLakeOSS
mission is to accelerate innovation for its customers by unifying Data
Science, Engineering and Business. Founded by the original creators of
Apache Spark, Databricks provides a Unified Analytics Platform for data
science teams to collaborate with data engineering and lines of business
to build data products. Users achieve faster time-to-value with
Databricks by creating analytic workflows that go from ETL and
interactive exploration to production. The company also makes it easier
for its users to focus on their data by providing a fully managed,
scalable, and secure cloud infrastructure that reduces operational
complexity and total cost of ownership. Databricks has secured
investments from Andreessen Horowitz, Coatue Management, Microsoft, New
Enterprise Associates (NEA), Battery Ventures, Green Bay Ventures, and
Geodesic, among others, and has a global customer base that includes
Viacom, Shell and HP.
Apache, Apache Spark and Spark are trademarks of the Apache Software