Databricks – Nextra

References

Databricks Medallion Architecture

Bronze

Ingests raw data in its original format
No data cleanup or validation
Is appended incrementally and grows over time.
Serves as the single source of truth, preserving the data's fidelity.
Enables reprocessing and auditing by retaining all historical data.
Can be any combination of streaming and batch transactions from sources
Metadata columns might be added

Silver

Represents validated, cleaned, and enriched versions of the data.
Data cleansing, deduplication, and normalization.
Enhances data quality by correcting errors and inconsistencies.
Structures data into a more consumable format for downstream processing.
Start modeling data

Gold

Business users oriented, aligns with business logic and requirements.
Consists of aggregated data tailored for analytics and reporting.
Optimized for performance in queries and dashboards
Contains fewer datasets than silver and bronze

Databricks Account

Workspace

Databricks workspace is a web application, functioning as the management console and cloud integrated development environment.
One account can have multiple workspaces in one region.

Workspace - Architecture (opens in a new tab)

Workspace - Control Plane

Backend services that Databricks manages in your Databricks account.
Web application
Unity Catalog
Compute orchestration

Workspace - Compute Plane

Classic compute (opens in a new tab)

Workspace - Classic compute architecture

Databricks compute resources run in your AWS account.
Types
- all-purpose (opens in a new tab)
- Jobs (opens in a new tab)
  - Use Jobs compute instead of all-purpose compute for jobs. (opens in a new tab)
  - Jobs compute terminates automatically after a job run is complete.
  - You can reuse the same clulster across tasks for better price performance.
  - Use all-purpose compute when you are developing or testing jobs.
- Lakeflow SDP
- SQL warehouses
  - Classic
    - Gist
      - Most basic SQL compute, less performance than Pro and Serverless compute.
      - In customer cloud
      - Supports Photon engine
  - Pro
    - More advanced features than Classic (Predictive IO)
    - In customer cloud
    - Supports Photon engine
    - Does not support Intelligent Workload Management
    - Less responsive, slower auto scaling than serverless SQL warehouse
    - Use cases
      - Serverless SQL warehouses are not available in a region.
      - You have custom-defined networking and want to connect to databases in your network in the cloud or on-premises for federation or a hybrid-type architecture.
Instance pools
- Prefer serverless compute to instance pool whenever possible
- More responsive, reduce cluster start and scaling time

Access modes

Standard (opens in a new tab) (shared)
Dedicated (single user)

Serverless compute

Workspace - Serverless compute

Best practices for serverless compute (opens in a new tab)
- Migration
  - The data being accessed must be stored in Unity Catalog.
  - The workload should be compatible with standard compute.
  - The workload should be compatible with Databricks Runtime 14.3 or above.
Advantages
- No operational overhead (managed by Databricks)
- On-demand usage, no costly idle time
- Auto-scaling, no underprovisioning or overprovisioning
- Faster start-up time than starting a classic compute cluster
- Customer data isolated by network and security boundary
- SQL warehouse Auto Stop value can be as low as 1 minute.
Types
- Notebooks
- jobs (opens in a new tab)
  - No support for Continuous scheduling.
  - No support for default or time-based interval triggers in Structured Streaming.
- Lakeflow SDP
- SQL warehouses
  - Cutting edge features (Predictive IO and Intelligent Workload Management)
  - Most cost performant
  - Use cases
    - ETL
    - Business intelligence
    - Exploratory analysis

SQL warehouses

Optimized compute for SQL queries, analytics, and business intelligence workloads with serverless or classic options.
Only support SQL cells in Notebooks.
SQL warehouse settings
- Auto Stop
  
  Whether the warehouse stops if it's idle for the specified number of minutes
- Cluster size
  
  The size of the driver node and number of worker nodes associated with the cluster
- Scaling
  
  The minimum and maximum number of clusters that will be used for a query
  
  Databricks recommends a cluster for every 10 concurrent queries.

Warehouse type	Photon Engine	Predictive IO	Intelligent Workload Management
Serverless	X	X	X
Pro	X	X
Classic	X

Delta tables (opens in a new tab)

Delta table types

Table type	Description
Unity Catalog managed table	Always backed by `Delta Lake`. The default and recommended table type on Databricks. Provides many built-in optimizations.
Unity Catalog external table	Optionally backed by `Delta Lake`. Supports some legacy integration patterns with external `Delta Lake` clients.
Unity Catalog foreign table	Might be backed by `Delta Lake`, depending on the foreign catalog. Foreign tables backed by `Delta Lake` do not have many optimizations present in `Unity Catalog` managed tables.
Streaming table	A `Lakeflow Spark Declarative Pipelines` dataset backed by `Delta Lake` that includes an append or `AUTO CDC ... INTO` flow definition for incremental processing.
`Hive metastore` table	Foreign tables in an internal or external federated `Hive metastore` and tables in the legacy workspace `Hive metastore`. Both managed and external `Hive metastore` tables can optionally be backed by `Delta Lake`.
Materialized view	A `Lakeflow Spark Declarative Pipelines` dataset backed by `Delta Lake` that materializes the results of a query using managed flow logic.

Delta tables - feature compatibility and protocols (opens in a new tab)

Interface

REST API

REST - Call API endpoint with CLI (opens in a new tab)

# Example
databricks api get $ENDPOINT_PATH

Example - GET /api/2.0/serving-endpoints (opens in a new tab)
```
databricks api get /api/2.0/serving-endpoints
```

CLI (opens in a new tab)

Databricks CLI commands (opens in a new tab)

CLI - Setup - Workspace-level

Authorize service principal access to Databricks with OAuth (opens in a new tab)

UI -> Settings -> Identity and access -> Service principals

# $HOME/.databrickscfg
# DEFAULT profile doesn't need to be specified explicitly
[DEFAULT]
host          = $workspace_url
client_id     = $service_principal-Application_ID
client_secret = $secret

CLI - Show auth info

databricks auth env

CLI - Verify profiles

databricks auth profiles

CLI - Show clusters info

databricks clusters spark-versions

CLI - Troubleshoot CLI command execution

Will display underlying HTTP requests and responses.

databricks $COMMAND --debug

Example:

> databricks workspace list /Workspace/Users/orange.lavender.9@gmail.com --debug
15:00:25 Info: start pid=48008 version=0.272.0 args="databricks, workspace, list, /Workspace/Users/orange.lavender.9@gmail.com, --debug"
15:00:25 Debug: Loading DEFAULT profile from C:\Users\Takechiyo/.databrickscfg pid=48008 sdk=true
15:00:25 Debug: Failed to configure auth: "pat" pid=48008 sdk=true
15:00:25 Debug: Failed to configure auth: "basic" pid=48008 sdk=true
15:00:26 Debug: GET /oidc/.well-known/oauth-authorization-server
< HTTP/2.0 200 OK
< {
<   "authorization_endpoint": "https://dbc-80546867-03ea.cloud.databricks.com/oidc/v1/authorize",
<   "claims_supported": [
<     "iss",
<     "sub",
<     "aud",
<     "iat",
<     "exp",
<     "jti",
<     "name",
<     "family_name",
<     "given_name",
<     "preferred_username"
<   ],
<   "code_challenge_methods_supported": [
<     "S256"
<   ],
<   "grant_types_supported": [
<     "client_credentials",
<     "authorization_code",
<     "refresh_token"
<   ],
<   "id_token_signing_alg_values_supported": [
<     "RS256"
<   ],
<   "issuer": "https://dbc-80546867-03ea.cloud.databricks.com/oidc",
<   "jwks_uri": "https://ohio.cloud.databricks.com/oidc/jwks.json",
<   "request_uri_parameter_supported": false,
<   "response_modes_supported": [
<     "query",
<     "fragment",
<     "form_post"
<   ],
<   "response_types_supported": [
<     "code",
<     "id_token"
<   ],
<   "scopes_supported": [
<     "all-apis",
<     "email",
<     "offline_access",
<     "openid",
<     "profile",
<     "sql"
<   ],
<   "subject_types_supported": [
<     "public"
<   ],
<   "token_endpoint": "https://dbc-80546867-03ea.cloud.databricks.com/oidc/v1/token",
<   "token_endpoint_auth_methods_supported": [
<     "client_secret_basic",
<     "client_secret_post",
<     "none"
<   ]
< } pid=48008 sdk=true
15:00:26 Debug: Generating Databricks OAuth token for Service Principal (c9fa7fd8-0982-4430-89f3-a8723af7441e) pid=48008 sdk=true
15:00:26 Debug: POST /oidc/v1/token
> <http.RoundTripper>
< HTTP/2.0 200 OK
< {
<   "access_token": "**REDACTED**",
<   "expires_in": 3600,
<   "scope": "all-apis",
<   "token_type": "Bearer"
< } pid=48008 sdk=true
15:00:27 Debug: GET /api/2.0/workspace/list?path=/Workspace/Users/orange.lavender.9@gmail.com
< HTTP/2.0 200 OK
< {} pid=48008 sdk=true
ID  Type  Language  Path
15:00:27 Info: completed execution pid=48008 exit_code=0
15:00:27 Debug: no telemetry logs to upload pid=48008

Workspace UI (opens in a new tab)

UI - Get resource URL/path

Workspace UI - Get resource URL/path

PySpark

PySpark API (opens in a new tab)

Lists the contents of a directory

dbutils.fs.ls('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')

Use Ctrl + SPACE to trigger autocomplete.

Read a CSV file

SELECT * FROM csv.`dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/market_data.csv`

Read a text file

SELECT * FROM csv.`dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/README.md`

Read a file in binary format (HEX encoded)

SELECT * FROM binaryFile.`dbfs:/databricks-datasets/travel_recommendations_realtime/README.txt`

Databricks Connect

like a JDBC driver, the Databricks Connect is a library that allows you to connect to a Databricks cluster from your own application.
Built on Spark Connect

Lakeflow Connect (opens in a new tab)

Databricks Blog - Lakeflow Connect: More connectors, and fast direct writes to Unity Catalog (opens in a new tab)
Gist
- 40+ GA connectors: Salesforce, Workday, SharePoint, SQL Server, Oracle, Postgres, Redshift, BigQuery, SFTP and more. (databricks.com)
- Zerobus Ingest connector (opens in a new tab) write API streams events straight into Unity Catalog at ~100 MB/s with 5s end-to-lake latency.
- Source-aware CDC optimisations so you don’t babysit log positions.

Lakeflow Connect - Manual File Uploads (opens in a new tab)

Lakeflow Connect - Managed connectors (opens in a new tab)

Maintained by Databricks
Types
- SaaS connectors (opens in a new tab)
  - Google Analytics
  - Salesforce
  - Workday
  - SharePoint
  - ServiceNow
- Database connectors (opens in a new tab)
  - SQL Server

Lakeflow Connect - Standard connectors (opens in a new tab)

Sources
- Cloud object storage
  - Amazon S3 (s3://)
  - Azure Data Lake Storage (ADLS, abfss://)
  - Google Cloud Storage (GCS, gs://)
  - Azure Blob Storage (wasbs://)
  - Databricks File System (DBFS, dbfs:/)
- SFTP servers
- Apache Kafka
- Amazon Kinesis
- Google Pub/Sub
- Apache Pulsar
Ingestion methods
- Batch
  - Gist
    - Load data as batches of rows into Databricks, often based on a schedule.
    - Traditional batch ingestion processes all records each time it runs
  - Usage
    - CREATE TABLE AS (CTAS)
    - spark.read.load()
- Incremental Batch
  - Gist
    - Only new data is ingested, previously loaded records are skipped automatically.
  - Usage
    - COPY INTO
    - spark.readStream (Auto Loader with timed trigger (opens in a new tab))
    - Declarative Pipelines: CREATE OR REFRESH STREAMING TABLE
- Streaming
  - Gist
    - Continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time.
    - Micro-batch processes small batches a very short, frequent intervals
  - Usage
    - spark.readStream (Auto Loader with continuous trigger (opens in a new tab))
    - Declarative Pipelines: (trigger mode continuous)

Standard connectors - Cloud object storage - `CREATE TABLE AS (CTAS)`

Batch method

SQL Reference - CREATE TABLE [USING] (opens in a new tab)

Standard connectors - Cloud object storage - `COPY INTO` (opens in a new tab)

Incremental batch
Idempotent
Easily configurable file or directory filters from cloud storage

S3, ADLS, ABFS, GCS, and Unity Catalog volumes
Support for multiple source file formats

CSV, JSON, XML, Avro, ORC, Parquet, text, and binary files
Target table schema inference, mapping, merging, and evolution
load data from a file location into a Delta table

SQL Reference - COPY INTO (opens in a new tab)

Standard connectors - Cloud object storage - Auto Loader (opens in a new tab)

Built on Structured Streaming, providing a Structured Streaming source called cloudFiles.

Declarative Pipeline

Python
```
spark.readStream
  .format("cloudFiles)
```

SQL

Example:

CREATE OR REFRESH STREAMING TABLE csv_data (
  id int,
  ts timestamp,
  event string
)
AS SELECT * FROM STREAM read_files(
  's3://bucket/path',
  format => 'csv',
  schema => 'id int, ts timestamp, event string'
);

Incremental batch or streaming, alternative and successor to COPY INTO

Auto Loader - Schema Inference and Evolution

Specifying a target directory for the option cloudFiles.schemaLocation enables schema inference and evolution.
If you use Lakeflow SPD, Databricks manages schema location and other checkpoint information automatically.
Schema Inference
- Use directory _schemas at the configured cloudFiles.schemaLocation to track schema changes to the input data over time.
- For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files).
- For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files.

`cloudFiles.schemaEvolutionMode`

addNewColumns and failOnNewColumns require human intervention to confirm.

`cloudFiles.schemaEvolutionMode` - `addNewColumns` (default)

Stream fails.
New Columns are added to the schema.
Existing columns do not evolve data types.

`cloudFiles.schemaEvolutionMode` - `rescue`

Stream doesn't fail.
Schema is not evolved.
New columns are recorded in the rescued data column.

`cloudFiles.schemaEvolutionMode` - `failOnNewColumns`

Stream fails.
Stream does not restart unless the provided schema is updated, or the offending data file is removed.

`cloudFiles.schemaEvolutionMode` - `none`

Stream doesn't fail.
Schema is not evolved.
New columns are ignored.

Lakeflow Spark Declarative Pipelines (SDP, previously `DLT`)

Lakeflow Spark Declarative Pipelines

Benefits
- Automatic orchestration
- Declarative processing
- Incremental processing
  - The engine will only process new data and changes in the data sources whenever possible.
    
    Incremental refresh for materialized views (opens in a new tab)
DLT
- DLT does not support shared clusters. You need to configure the cluster (a jobs cluster) via the DLT use (under workflows).
- You would still write your notebook with DLT using Python but the pipeline itself has to be configured via the UI and run from the UI.
From DLT to LDP
Gist
- Same CREATE LIVE TABLE DSL, now built on Spark Declarative Pipelines, an open standard you can run anywhere Spark runs (opens in a new tab).
- New IDE for Data Engineering: side-by-side code & DAG, inline data previews, Git integration and an AI copilot.
- AUTO CDC ... in SQL / create_auto_cdc_flow() in Python replace manual APPLY CHANGES.
- Lakeflow SDP provide a declarative approach to defining relationships between datasets and transformations.
SQL

SQL code that creates pipeline datasets uses the CREATE OR REFRESH syntax to define materialized views and streaming tables against query results.

The STREAM keyword indicates if the data source referenced in a SELECT clause should be read with streaming semantics.
- LIVE
  
  Legacy publishing mode uses the LIVE virtual schema (opens in a new tab) to achieve similar behavior. In the default publishing mode (used by all new pipelines), the LIVE keyword is ignored.
  
  In legacy publishing mode pipelines, you can use the LIVE keyword to reference other datasets in the current pipeline for reads, e.g. SELECT * FROM LIVE.bronze_table. In the default publishing mode for new Lakeflow Spark Declarative Pipelines, this syntax is silently ignored, meaning that unqualified identifiers use the current schema.
Resources
- Databricks Lakeflow GA(2025): Good-bye DLT, Hello Unified Data Engineering (opens in a new tab)
- Databricks Blog - Bringing Declarative Pipelines to the Apache Spark™ Open Source Project (opens in a new tab)

Lakeflow SPD - Pipeline Mode (opens in a new tab)

Key questions	`Triggered`	`Continuous`
When does the update stop?	Automatically once complete.	Runs continuously until manually stopped.
What data is processed?	Data available when the update starts.	All data as it arrives at configured sources.
What data freshness requirements is this best for?	Data updates run every 10 minutes, hourly, or daily.	Data updates are desired between every 10 seconds and a few minutes.

Pipeline Mode - Triggered (opens in a new tab)

Lakeflow SDP stops processing after successfully refreshing all tables or selected tables, ensuring each table in the update is refreshed based on the data available when the update starts.
You can schedule triggered Lakeflow SPD to run as a task in a job.

Pipeline Mode - Continuous (opens in a new tab)

Lakeflow SDP processes new data as it arrives in data sources to keep tables throughout the pipeline fresh.

Lakeflow SPD - Pricing (opens in a new tab)

Lakeflow SPD - Flows (opens in a new tab)

Lakeflow SPD unique flow types

AUTO CDC (opens in a new tab)

Streaming flow
Handles out of order CDC events and supports both SCD Type 1 and 2 (Auto CDC is not available in Apache Spark Declarative Pipelines).

Example:

CREATE OR REFRESH STREAMING TABLE atp_cdc_flow AS
AUTO CDC INTO atp_silver                                    -- Target table to update with SCD Type 1 or 2
FROM STREAM atp_bronze                                      -- Source records to determine updates, deletes and inserts
  KEYS (player)                                             -- Primary key for identifying records
  APPLY AS DELETE WHEN operation = "DELETE"                 -- Handle deletes from source to the target
  SEQUENCE BY ranking_date                                  -- Defines order of operations for applying changes
  COLUMNS * EXCEPT (ranking_date, operation, sequenceNum)   -- Select columns and exclude metadata fields
  STORED AS SCD TYPE 1                                      -- Use SCD Type 1

Materialized View
- Batch flow

Lakeflow SPD - Streaming Tables (opens in a new tab)

Each input row is handled only once.
The STREAM keyword before read_files tells the query to treat the dataset as a stream.
Trigger Policy
- Continuous
- Once
- Timed

Example

CREATE OR REFRESH STREAMING TABLE atp_bronze AS
SELECT
  *
FROM
  STREAM read_files(
    '/Volumes/workspace/experiment/atp/*.csv',
    format => "csv",
    sep => ",",
    header => true
  );
 
CREATE OR REFRESH STREAMING TABLE atp_silver AS
SELECT
  player,
  points
FROM
  STREAM atp_bronze;
 
CREATE OR REFRESH MATERIALIZED VIEW atp_gold AS
SELECT
  player,
  SUM(points) AS total_points
FROM
  atp_silver
GROUP BY
  player
ORDER BY
  player;

Lakeflow SPD - Materialized Views (opens in a new tab)

Lakeflow SPD - Data Quality

Manage data quality with pipeline expectations

Expectations (opens in a new tab)

Recommendation	Impact
Store expectation definitions separately from pipeline logic.	Easily apply expectations to multiple datasets or pipelines. Update, audit, and maintain expectations without modifying pipeline source code.
Add custom tags to create groups of related expectations.	Filter expectations based on tags.
Apply expectations consistently across similar datasets.	Use the same expectations across multiple datasets and pipelines to evaluate identical logic.

Lakeflow SPD - Expectations - On Violation

warn

Invalid records are written to the target.
```
EXPECT
```
drop

Invalid records are dropped before data is written to the target.
```
EXPECT ... ON VIOLATION DROP ROW
```
fail

Invalid records prevent the update from succeeding.
```
EXPECT ... ON VIOLATION FAIL UPDATE
```

Lakeflow SPD - Compute

Lakeflow SPD - Compute - Serverless

Lakeflow SPD - Compute - Classic

Lakeflow SPD - Compute - Classic - Core

Declarative Pipelines in Python and SQL

Lakeflow SPD - Compute - Classic - Pro

Declarative Pipelines in Python and SQL
CDC

Lakeflow SPD - Compute - Classic - Advanced

Declarative Pipelines in Python and SQL
CDC
Data Quality Control

Lakeflow SPD - DLT - SQL

Delta Live Tables quickstart (SQL) (opens in a new tab)

Lakeflow Jobs (previously `Workflows`)

Databricks Blog - Lakeflow Jobs: Reliable orchestration for all workloads with unified observability (opens in a new tab)
Gist
- Everything Workflows did, plus loops, conditionals, parameter-scoped retries and table-update triggers.
- Serverless performance mode GA: 3–5× faster cold starts, or cost-optimised mode if pennies matter.
- Unified Jobs & Pipelines left-nav entry: no more hunting for 2 icons.

Lakeflow Jobs - Jobs

A job is used to schedule and orchestrate tasks on Databricks in a workflow.

Lakeflow Jobs - Schedules and Triggers

Trigger types
- Time-based schedules
- Continuous
- File arrival events
- Table update events
- Manual

Lakeflow Jobs - Tasks (opens in a new tab)

Gist
- A job can have one or more tasks in it.
- Each task can use different compute resources.
- You can schedule triggered Lakeflow SPD to run as a task in a job.
- Job parameters are defined at the job level, and passed to all tasks in the same job.
- Job parameters take precedence over task parameters. If a job parameter and a task parameter have the same key, the job parameter overrides the task parameter.
Types of tasks
- Notebook
- Python script
- Python wheel
- SQL
- Pipeline
- Dashboards
- Power BI
- dbt
- dbt platform (Beta)
- JAR
- Spark Submit
- Run Job
- If/else
- For each

Lakeflow Jobs - Tasks - Task parameters

Lakeflow Jobs - Tasks - Task values

Lakeflow Jobs - Tasks - Iterative tasks

For Each loops over an input array and runs the same nested task per item, passing it as {{input}}.
For Each task is the container task that manages the loop.
Nested tasks in the loop are the actual tasks that executes for each iteration.

Unity Catalog (opens in a new tab)

Best Practices (opens in a new tab)

Data Governance (opens in a new tab)

Data Access Control (opens in a new tab)

Unity Catalog Isolation Features

Layer	Purpose	Mechanisms
Workspace-level restrictions	Limit which workspaces can access specific catalogs, external locations, and storage credentials	Workspace-level bindings
Privileges and ownership	Control access to catalogs, schemas, tables, and other objects	Privilege grants to users and groups, object ownership
Attribute-based policies (ABAC)	Use tags and policies to dynamically apply filters and masks	ABAC policies and governed tags
Table-level filtering and masking	Control what data users can see within tables	Row filters, column masks, dynamic views

Data Access Audit

Capture and record all access to data
Data Discovery

Ability to search for and discover authorized assets

Gist
- Each cloud region has its own Unity Catalog instance.

Managed storage location (opens in a new tab)

Associated Unity Catalog object	How to set	Relation to external locations
Metastore	Configured by account admin during metastore creation.	Cannot overlap an external location.
Standard catalog	Specified during catalog creation using the `MANAGED LOCATION` keyword.	Must be contained within an external location.
Foreign catalog	Specified after catalog creation using Catalog Explorer.	Must be contained within an external location.
Schema	Specified during schema creation using the `MANAGED LOCATION` keyword.	Must be contained within an external location.

You can associate a managed storage location with a metastore, catalog, or schema. Managed storage locations at lower levels in the hierarchy override storage locations defined at higher levels when managed tables or managed volumes are created.
You can choose to store data at the metastore level, providing a default storage location for catalogs that don't have a managed storage location of their own.
Databricks recommends that you assign managed storage at the catalog level for logical data isolation, with metastore-level and schema-level as options.

Unity Catalog - Data Lineage (opens in a new tab)

Data Lineage

Gist
- Tracks and visualizes the origin and usage of data assets, providing transparency and traceability.
- You must have at least the BROWSE privilege on the parent catalog of the table or view.
- Lineage is supported for all languages and is captured down to the column level.
- Lineage data includes notebooks, jobs, and dashboards related to the query.
- Lineage can be visualized in Catalog Explorer in near real time and retrieved programmatically using the lineage system tables and the Databricks REST API.
- Lineage is aggregated across all workspaces attached to a Unity Catalog metastore. This means that lineage captured in one workspace is visible in any other workspace that shares that metastore.
- Lineage data is retained for 1 year.
- Detailed information about workspace-level objects like notebooks and dashboards in other workspaces is masked.
Demo (opens in a new tab)
System table: system.access.table_lineage (opens in a new tab)

Unity Catalog - Metastore (opens in a new tab)

Gist
- Top-level container for metadata in Unity Catalog
- There can be multiple metastores in a single account, but you can have only one metastore per region. - For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.
- All workspaces in one region share the only metastore. You can link a single metastore to multiple workspaces in the same region, giving each workspace the same data view.
- Metastores provide regional isolation but are not intended as default units of data isolation, with no Managed storage set. Data isolation typically begins at the catalog level (Managed storage config).
Ownership
- Metastore admins are the owners of the metastore.
- Metastore admins can reassign ownership of the metastore by transferring the metastore admin role.

Unity Catalog - Catalog (opens in a new tab)

Gist
- Your Databricks account has only one metastore per region, catalogs are inherently isolated by region.
- Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.
- By default, a catalog is shared with all workspaces attached to the parent metastore.
- Workspace-catalog binding (opens in a new tab) can be used to limit catalog access to specific workspaces in your account.
- Always assign ownership of production catalogs and schemas to groups, not individual users.
- Grant USE CATALOG and USE SCHEMA only to users who should be able to see or query the data contained in them.

Unity Catalog - Views

View types

Unity Catalog - Volumes (opens in a new tab)

Use cases
- Register landing areas for raw data produced by external systems to support its processing in the early stages of ETL pipelines and other data engineering activities.
- Register staging locations for ingestion. e.g. using Auto Loader, COPY INTO, or CTAS (CREATE TABLE AS) statements.
- Provide file storage locations for data scientists, data analysts, and machine learning engineers to use as parts of their exploratory data analysis and other data science tasks.
- Give Databricks users access to arbitrary files produced and deposited in cloud storage by other systems. For example, large collections of unstructured data (such as image, audio, video, and PDF files) captured by surveillance systems or IoT devices, or library files (JARs and Python wheel files) exported from local dependency management systems or CI/CD pipelines.
- Store operational data, such as logging or checkpointing files.
- Recommended for managing all access to non-tabular data in cloud object storage and for storing workload support files.
catalog.schema.volume

Access pattern

/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

Separation of metastore management

Databricks workspaces management before and after Unity Catalog

Unity Catalog - Privileges

Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.

3 Level hierarchy: catalog.schema.table
- Catalog
  
  Object privileges
- Schema (Database)
  
  Object privileges
- Table / View / Function
  
  Object privileges

Identity / Principal (opens in a new tab)

Principals can be granted privileges and can be securable objects owner.
Allow identities to be created once at the account level and then assigned to multiple workspaces as needed.

User (opens in a new tab)

Represented by an email address

Service principal (opens in a new tab)

Represented by an application ID
The creator of a service principal automatically becomes the service principal manager.

Group (opens in a new tab)

Ownership (opens in a new tab)

Gist
- Each securable object in Unity Catalog has an owner.
- The principal that creates an object becomes its initial owner.
- An object's owner has all privileges on the object but not on any of its child objects.
- Can grant privileges on the object itself and on all of its child objects to other principals.
- Can drop the object.
- Owners can grant the MANAGE privilege to delegate ownership abilities on an object to other principals.
- Catalog and schema owners can transfer ownership of any object in the catalog or schema.
Example
- Owners of a schema do not automatically have all privileges on the tables in the schema, but they can grant themselves privileges on the tables in the schema.

Admins (opens in a new tab)

Account admin (opens in a new tab)

Can create metastores, and by default become the initial metastore admin.
Can link metastores to workspaces.
Can assign the metastore admin role.
Can grant privileges on metastores.
Can enable Delta Sharing for a metastore.
Can configure storage credentials.
Can enable system tables and delegate access to them.
Can add service principals to the account and assign them admin roles.
Can assign service principals to workspaces.
Automatically have service principal manager role on all service principals in the account.

Workspace admin (opens in a new tab)

Can add users, service principals, and groups to a workspace.
Can delegate other workspace admins.
Can manage job ownership.
Can manage the job Run as setting.
Can view and manage notebooks, dashboards, queries, and other workspace objects.
Automatically have service principal manager role on service principals they create.
Create tokens on behalf of service principals when they have the Service Principal User role.

Metastore admin (opens in a new tab)

Metastore admin role is optional.
Metastore admins are the owners of the metastore.
Metastore admins can reassign ownership of the metastore by transferring the metastore admin role

Inheritance model (opens in a new tab)

Granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. If you give a user the SELECT privilege on a catalog, then that user will be able to select (read) all tables and views in that catalog.
Privileges that are granted on a Unity Catalog metastore are not inherited.
Owners of an object are automatically granted all privileges on that object.

Securable objects (opens in a new tab)

Privilege types by securable object (opens in a new tab)

Privilege types by securable object

ALL PRIVILEGES

To avoid accidental data exfiltration or privilege escalation, ALL PRIVILEGES does not include the EXTERNAL USE SCHEMA, EXTERNAL USE LOCATION, or MANAGE privileges.

Catalog - USE CATALOG (opens in a new tab)

USE CATALOG privilege on the parent catalog is required to access any schema or table in that catalog.
Not required to read an object's metadata
USE CATALOG | SCHEMA grants the ability to view data in the catalog or schema. Alone, these privileges do not grant SELECT or READ on the objects inside the catalog or schema, but they are a prerequisite to granting users that access. Grant these privileges only to users who should be able to view data in the catalog or schema.

Catalog - BROWSE

BROWSE allows users to view metadata for objects in a catalog using Catalog Explorer, the schema browser, search, the lineage graph, information_schema, and the REST API. It does not grant access to data.
BROWSE enables users to discover data and request access to it, even if they do not have the USE CATALOG or USE SCHEMA privileges.
Databricks recommends granting BROWSE on catalogs to the All account users group at the catalog level to make data discoverable and support access requests.

Schema - USE SCHEMA

USE SCHEMA privilege on the parent schema is required to access any table in that schema.
Not required to read an object's metadata

MANAGE (opens in a new tab)

Similar to ownership
Users with the MANAGE privilege are not automatically granted all privileges on that object, but they can grant themselves privileges manually.
The user must also have the USE CATALOG privilege on the object's parent catalog and the USE SCHEMA privilege on its parent schema.
ALL PRIVILEGES does not include the MANAGE privilege
Privileges
- View and manage privileges
- Transfer ownership
- Drop an object
- Rename an object

Manage privileges (opens in a new tab)

SHOW (opens in a new tab)

SHOW GRANTS [principal] ON  <securable-type> <securable-name>

GRANT

GRANT <privilege-type> ON <securable-type> <securable-name> TO <principal>

REVOKE

REVOKE <privilege-type> ON <securable-type> <securable-name> FROM <principal>

Docs - UI (opens in a new tab)

In the GitHub repo, Unity Catalog UI (opens in a new tab)

Docs - CLI (opens in a new tab)

In the GitHub repo, CLI - uc (opens in a new tab)

Unity Catalog - Integration with Confluent stack

Integration with Confluent stack

Unity Catalog - Delta Sharing (opens in a new tab)

Based on Delta Lake - Delta Sharing (opens in a new tab)
Data from other metastores can be accessed using Delta Sharing if needed to enable cross-regional federation.

Unity Catalog - Predictive Optimization (opens in a new tab)

Identifies tables that would benefit from ANALYZE, OPTIMIZE, and VACUUM operations and queues them to run using serverless compute for jobs.
Only applies to managed table

Notebooks

Known limitations Databricks notebooks (opens in a new tab)

Notebooks - Examples - Pivot in SQL (opens in a new tab)

Notebooks - Examples - Read Parquet Files (Scala) (opens in a new tab)

DevOps (opens in a new tab)

Provision Databricks Workspace

Create Databricks workspaces with the Databricks Terraform provider (opens in a new tab).

Provision Databricks projects and resources

Databricks Asset Bundles (opens in a new tab)

high-level view of a development and CI/CD pipeline with DAB

Consists of
- Required cloud infrastructure and workspace configurations
- Source files, such as notebooks and Python files, that include the business logic
- Definitions and settings for Databricks resources, such as Lakeflow Jobs, Lakeflow Spark Declarative Pipelines, Dashboards, Model Serving endpoints, MLflow Experiments, and MLflow registered models
- Unit tests and integration tests
Infrastructure-as-code definition file in YAML format to define resources and configuration
Use Databricks CLI to operate
Use another CI/CD runner (e.g. Jenkins) to schedule and manage the pipeline for provisioning jobs

Databricks Git Folders (Repos) (opens in a new tab)

Gist
- Clone, push to, and pull from a remote Git repository.
- Create and manage branches for development work, including merging, rebasing, and resolving conflicts.
- Create notebooks, including IPYNB notebooks, and edit them and other files.
- Visually compare differences upon commit and resolve merge conflicts.
Git operations (opens in a new tab)

Data Access

Object	Object identifier	File path	Cloud URI
External location	no	no	yes
Managed table	yes	no	no
External table	yes	no	yes
Managed volume	no	yes	no
External volume	no	yes	yes

File access (opens in a new tab)

Locations
- Unity Catalog volumes
- Workspace files
- Cloud object storage
- DBFS mounts and DBFS root
  - Databricks Utilities (opens in a new tab)
    
    Only work with DBFS
- Ephemeral storage attached to the driver node of the cluster
Access patterns

URI-style paths

POSIX-style paths

Parameterization

SQL - Parameter markers (opens in a new tab)

Typed
The API invoking the SQL statement must supply name-value pairs to associate each parameter marker with a value.

> DECLARE stmtStr = 'SELECT current_timestamp() + :later, :x * :x AS square';
> EXECUTE IMMEDIATE stmtStr USING INTERVAL '3' HOURS AS later, 15.0 AS x;
  2024-01-19 16:17:16.692303  225.00

SQL - Variables (opens in a new tab)

Typed
Session-scoped
Use DECLARE VARIABLE statement to declare a variable

-- A verbose definition of a temporary variable
> DECLARE OR REPLACE VARIABLE myvar INT DEFAULT 17;
 
-- A dense definition, including derivation of the type from the default expression
> DECLARE address = named_struct('street', 'Grimmauld Place', 'number', 12);
 
-- Referencing a variable
> SELECT myvar, session.address.number;
 17  12

Dynamic value reference

Use {{ }} syntax to use dynamic values in job JSON definitions used by the Databricks CLI and REST API.

Task parameter value

Task-scoped

Job parameter value

Job-scoped, including all tasks

DAB - Variable reference

${}

Notebook - Widgets (opens in a new tab)

Deprecated
```
${param}
```
Current
```
:param
```

Resources

Databricks Certified - Machine Learning Engineer Professional Datacamp

References

Reference - SQL Language reference (opens in a new tab)

Reference - SQL Data types (opens in a new tab)

Reference - Databricks REST API (opens in a new tab)

Bronze

Silver

Gold

Databricks Account

Workspace

Workspace - Architecture (opens in a new tab)

Workspace - Control Plane

Workspace - Compute Plane

Access modes

Serverless compute

SQL warehouses

Interface

REST API

REST - Call API endpoint with CLI (opens in a new tab)

CLI - Setup - Workspace-level

CLI - Show auth info

CLI - Verify profiles

CLI - Show clusters info

CLI - Troubleshoot CLI command execution

UI - Get resource URL/path

PySpark

Lists the contents of a directory

Read a CSV file

Read a text file

Read a file in binary format (HEX encoded)

Databricks Connect

Lakeflow Connect - Manual File Uploads (opens in a new tab)

Lakeflow Connect - Managed connectors (opens in a new tab)

Lakeflow Connect - Standard connectors (opens in a new tab)

Standard connectors - Cloud object storage - CREATE TABLE AS (CTAS)

Standard connectors - Cloud object storage - COPY INTO (opens in a new tab)

Standard connectors - Cloud object storage - Auto Loader (opens in a new tab)

Auto Loader - Schema Inference and Evolution

cloudFiles.schemaEvolutionMode

cloudFiles.schemaEvolutionMode - addNewColumns (default)

cloudFiles.schemaEvolutionMode - rescue

cloudFiles.schemaEvolutionMode - failOnNewColumns

cloudFiles.schemaEvolutionMode - none

Lakeflow Spark Declarative Pipelines (SDP, previously DLT)

Lakeflow SPD - Pipeline Mode (opens in a new tab)

Pipeline Mode - Triggered (opens in a new tab)

Pipeline Mode - Continuous (opens in a new tab)

Lakeflow SPD - Pricing (opens in a new tab)

Lakeflow SPD - Flows (opens in a new tab)

Lakeflow SPD - Streaming Tables (opens in a new tab)

Lakeflow SPD - Materialized Views (opens in a new tab)

Lakeflow SPD - Data Quality

Lakeflow SPD - Expectations - On Violation

Lakeflow SPD - Compute

Lakeflow SPD - Compute - Serverless

Lakeflow SPD - Compute - Classic

Lakeflow SPD - Compute - Classic - Core

Lakeflow SPD - Compute - Classic - Pro

Lakeflow SPD - Compute - Classic - Advanced

Lakeflow SPD - DLT - SQL

Lakeflow Jobs (previously Workflows)

Lakeflow Jobs - Jobs

Lakeflow Jobs - Schedules and Triggers

Lakeflow Jobs - Tasks (opens in a new tab)

Lakeflow Jobs - Tasks - Task parameters

Lakeflow Jobs - Tasks - Task values

Lakeflow Jobs - Tasks - Iterative tasks

Unity Catalog - Data Lineage (opens in a new tab)

Unity Catalog - Metastore (opens in a new tab)

Unity Catalog - Catalog (opens in a new tab)

Unity Catalog - Views

Unity Catalog - Volumes (opens in a new tab)

Separation of metastore management

Unity Catalog - Privileges

ALL PRIVILEGES

Catalog - BROWSE

Schema - USE SCHEMA

GRANT

REVOKE

Unity Catalog - Integration with Confluent stack

Unity Catalog - Delta Sharing (opens in a new tab)

Standard connectors - Cloud object storage - `CREATE TABLE AS (CTAS)`

Standard connectors - Cloud object storage - `COPY INTO` (opens in a new tab)

`cloudFiles.schemaEvolutionMode`

`cloudFiles.schemaEvolutionMode` - `addNewColumns` (default)

`cloudFiles.schemaEvolutionMode` - `rescue`

`cloudFiles.schemaEvolutionMode` - `failOnNewColumns`

`cloudFiles.schemaEvolutionMode` - `none`

Lakeflow Spark Declarative Pipelines (SDP, previously `DLT`)

Lakeflow Jobs (previously `Workflows`)