A massive amount of data is being produced (estimated 2.5+ quintillion bytes/daily) by different sources like social media platforms, websites, IoT devices, corporate databases and many more. Millions of users are connected 24/7/365 sharing information, uploading images and videos in social media platforms or any other websites/databases. So the question arises how this huge amount of data can be managed and leveraged for business decisions. That’s when Databricks comes into picture.

Databricks is an unified data analytics platform for data engineering, data science, machine learning and analytics. It allows business analysts, data engineers to build models & ETL pipelines, and deploy business process workflows using their platform. Apache Spark is the core of Databricks which is widely used in the industry for developing big data projects. Databricks is available as a service in Microsoft Azure, Amazon Web Services and Google Cloud Platform.

 

Big Data

In the modern world, data is much larger and complex. Companies rely on data to identify challenges, capitalize on opportunities, and make timely decisions. Traditional databases were struggling to manage data and solve business problems due to the enormous scale of data. We have 5Vs to describe the problem of big data.

Volume: Large amounts of data is being created

Velocity: Speed at which data is being created.

Variety: Many types of data (Structured, Semi Structured, Unstructured)

Veracity: Inconsistencies and uncertainty in the data

Value: Data should be transformed to comprehensible format.

 

Apache Spark

It is an open source, distributed data processing engine for big data and machine learning which is widely used in the industry to process petabytes of data. Spark runs on distributed computing platform where data is divided into partitions and each partition will be executed by a node of the cluster. Databricks provides all the management tools to work with Apache Spark. Databricks supports SQL queries, streaming data, machine learning, graph processing and many more. These supporting features help developer to work with high volume and different types of data.

 

 

Data Warehouse, Data Lake and Lakehouse

Data warehouse is a relational database that stores historical data for business purposes. Companies started using data warehouses in 1980s for data storage and analysis. It helped companies to identify challenges and solve business problems for many years. But as the variety and volume of data is increasing, data warehouses were unable to handle data.  Addition to that, there is no support for unstructured data and ML/AI workloads. To solve these problems, data lakes have been introduced around 2011 to support ML/AI workloads and variety of data. Data lake is low cost, easy to ingest and transform, works with unlimited types of data and various data formats. It helped business to make business decisions using the modern data with the help of its features. But there are few drawbacks pointed by data engineers and analysts that questioned the scope of data lake. It is not ACID complaint, jobs failing which leaves partial files, unable to rollback data and very slow to generate interactive BI reports. Considering these issues, Data Lakehouse has been introduced which is a combination of Data warehouse and Data lake.

The Data Lakehouse is a modern, data management architecture that is cost efficient, flexible, separates computing and storage, provides high data throughput, enables implementation of data structures, and provides infinite storage. Delta lake is an open-source storage layer built on data lake and is specifically designed for Apache Spark to build robust data lakes. A data lake built with delta lake is a part of Data lakehouse architecture. Delta lake provides features like ACID transactions that guarantees consistency, streaming and batch unification, time travel and many more. Time travel is one of the main features of Delta lakes which helps the developers to rollback previous transactions or updates using the versions in delta log. Databricks provides management tools to create the Lakehouse using delta lakes.

Logan Data is your Databricks business partner and can guide your team’s Lakehouse strategy, return-on-investment assesment, implementation, and provide ongoing support!