Get Started with Databricks for Data Engineering

April 24, 2026

Learning Objectives

List the four key product lines from Databricks
Describe how each product line serves its respective audience
Describe the primary responsibility and core skill of a Data Engineer
Explain the stages of a traditional Data Engineering architecture, from data ingestion to processing, storage and consumption
Define the core concepts of the Databricks Lakeflow offering, and describe how it benefits Data Engineering practitioners
Describe the Medallion Architecture for data transformation in Databricks
Describe the benefits of Delta Lake and Delta Tables in the Databricks Data Intelligence Platform
List the available data ingestion techniques in the Databricks Data Intelligence Platform
Describe how Lakeflow Spark Declarative Pipelines and Lakeflow Jobs facilitate unified orchestration in Databricks

Data Engineering

Data Engineers are responsible for extracting data from disparate sources into a centralised storage location, as well as incrementally and progressively improving the quality of the data, as it moves through each layer of the data platform.

Their responsibilities often include:

Designing, building and maintaining data pipelines, which manage data flow from source to storage to consumer tools
Transforming raw data to clean reliable data
Ensuring the quality and integrity of data

Data Engineers are faced with a number of challenges, such as:

Complex data ingestion methods
Supporting key engineering principles, such as:
- Agile
- CI/CD
- Version Control
- Isolating the development, test and production environments
Managing third-party orchestration tools

Databricks

Databricks is a centralised, managed platform for data. In a typical Databricks architecture, data is ingested from a variety of sources to a storage system. It is then processed, cleaned and made-ready for consumers. The process is orchestrated and includes governance and security.

There are four key product lines in the Databricks offering:

1. Lakeflow

Lakeflow handles the Extract, Transform, Load process (ETL) and is likely the most important product for Data Engineers within the Databricks platform. It consists of three components:

Lakeflow Connect

Connect provisions connectors for ingesting data from a variety of data sources, such as applications, databases and cloud. There are three core features:

Manual File Upload

This feature allows for the upload of a file from a users local file system, directly to Databricks. Databricks supports all file types, structured, semi-structured and unstructured, when uploaded to a Volume, which represents a logical space for storage in a cloud object storage location. Volumes are intended for path based data access to the file only. For tabular data, a data table is preferred. The following file types can be manually uploaded to a data table:

CSV
TSV
JSON
Avro
Parquet
TXT

Standard connectors

This feature is used to ingest data from remote sources, such as object storage or message queues, among others.

Managed connectors

This feature is used to ingest data from enterprise sources such as Software as a Service (SaaS) or Databases, among others. Managed connectors leverage incremental read/writes and are fast, scalable and cost-effective.

Lakeflow Spark Declarative Pipelines

Once the data in ingested to the Databricks platform, Spark Declarative Pipelines can be used to propagate the data through the platform, from raw to consumer-ready. Traditioanl data platforms require imperative code, which explicitly states how to do things, step-by-step:

Read data
Filter rows
Join tables
Write output

With declarative pipelines, Engineers can use SQL or Python to write declarative code, for batch or streaming pipelines, and the Spark Declarative Pipelines handles:

Execution Plan
Error Handling
Dependency Management
Intelligent optimisation
- As you data grows, automated scaling and recovery from failures improve reliability and reduced maintenance
Unified batch and streaming
- Adapts to workload type, from near realt-time to batch, optimising fro performance and cost efficiency.

Lakeflow Jobs

Jobs is the orchestrator for not just for Lakeflow Connect, but for the entire Databricks platform, including:

Notebooks
Jobs for SQL
Machine Learning models

and more. It allows automated workflows to be authored with real-time monitoring and supports a number of different smart triggers, such as:

Event based
Continuous execution
Complex Directed Acyclic Graph’s (DAG)

Jobs provides real-time monitoring with drill-down dashboards, allowing users to easily:

Visulise critical path for troubleshooting and optimisation
Live query profile for performance insights with query history
Searchable query profile

Jobs can be administered with an easy to use point-and-click interface, meaning no special skills are required.

2. Databricks SQL

Databricks SQL is the Data Warehousing component, providing foundational warehouse functionality, coupled with governance and administrations tooling. It is an important component for both Data Engineers and Analysts alike. Databricks uses Deltalake, an open source protocol for reading and writing files to cloud storage. It turns a Data Lake into a Lakehouse, combining traits from a standard Data Lake (flexibility and scalability) with a Data Warehouse (reliability and performance). Deltalake sits on-top of any data lake stack, such as:

AWS
GCP
Azure
etc

The Medallion Architecture

Databricks supports the medallion architecture, a design pattern sometimes referred to as a “multi-hop architecture”. It consists of three layers:

Bronze

Storage layer for raw, append-only and minimally transformed data, this is where data from external sources lands when it’s first ingested. Data is mostly stored as it originally existed, with at most Personally Identifiable Information (PII) removed, and some metadata columns added, for example:

Load
Datetime
ProcessID
etc

Data in the bronze layer is often has a long retention period, often persisted for many years.

Silver

Storage layer for filtered, cleaned and enriched data. Data is taken from the bronze layer, any errors are fixed, data is joined, deduplicated, validated, with business rules applied and schemas are either enforced, or evolved as necessary. Some silver tables may be crated from other, intermediate silver tables, as well as from the bronze layer. Tables are often subject themed, for example:

Customers
Transactions
etc

Gold

Storage layer for business-level aggregated data. Data from the Silver layer is modeled (commonly into marts using star schemas) so as to be ready for consumption. Data may be delivered to downstream users and applications, or it may be replicated to external systems.

Delta Tables

Delta Tables are the default format for tables in Databricks. A Delta Tables is a directory within the Data Lake, containing:

Data in Parquet files
Transaction log (JSON + checkpoint Parquet files)

Data is stored in states, with new entries added for each transaction. These states are managed by the log, which ascribes a version to each changed state. Data can be queried in any state, using either a timestamp or the version number. This process is known as Time Travel.

In addition to Time Travel, Data Tables have a number of key features:

ACID Compliance

Atomicity
- Entire transaction completes
Consistency
- Data follows rules or will be rolled back
Isolation
- One transaction completes before the start of another
Durability
- Data is saved in a persistent state once completed

Data Manipulation Language (DML):

INSERT
- Add new rows
UPDATE
- Update existing values
DELETE
- Delete rows
MERGE
- Select records from a source tables and perform multiple DML operations on a target table.

Schema Evolution and Enforcement

Schema evolution allows controlled adjustments to table schema (such as adding columns and handling certain schema changes) as data evolves.
Schema enforcement ensures that any data written to a Delta Table matches the tables defined schema, preventing schema mismatches and rejecting data that doesn’t conform, thereby ensuring data integrity.

3. AI/BI

The Business Intelligence space for Databricks, this is important for Data Analysts and Business Users. It consists of two components:

Dashboards

Space for running queries and visualising data

Genie

Space for asking natural language queries, perfect for uses who are not comfortable running queries with SQL.

4. Mosaic AI

A suite of tools for Generative AI application development, it is important for Data Scientists, Machine Learning Engineers, and AI Engineers. Mosaic AI provides Data Intelligence, AI agents that reason on your data and solve domain specific problems (distinct from General Intelligence, which consists of large models trained on the entire web).