Databricks

References

Reference - SQL Language reference (opens in a new tab)

Reference - SQL Data types (opens in a new tab)

Reference - Databricks REST API (opens in a new tab)

Databricks Runtime release notes versions and compatibility (opens in a new tab)

Databricks Architecture (opens in a new tab)

Lakehouse Medallion Architecture (opens in a new tab)

Databricks Medallion Architecture

Bronze

  • Ingests raw data in its original format
  • No data cleanup or validation
  • Is appended incrementally and grows over time.
  • Serves as the single source of truth, preserving the data's fidelity.
  • Enables reprocessing and auditing by retaining all historical data.
  • Can be any combination of streaming and batch transactions from sources
  • Metadata columns might be added

Silver

  • Represents validated, cleaned, and enriched versions of the data.
  • Data cleansing, deduplication, and normalization.
  • Enhances data quality by correcting errors and inconsistencies.
  • Structures data into a more consumable format for downstream processing.
  • Start modeling data

Gold

  • Business users oriented, aligns with business logic and requirements.
  • Consists of aggregated data tailored for analytics and reporting.
  • Optimized for performance in queries and dashboards
  • Contains fewer datasets than silver and bronze

Databricks Account

Databricks Account

Workspace

  • Databricks workspace is a web application, functioning as the management console and cloud integrated development environment.
  • One account can have multiple workspaces in one region.

Workspace - Architecture (opens in a new tab)

Workspace - Control Plane

  • Backend services that Databricks manages in your Databricks account.
  • Web application

Workspace - Compute Plane

Classic compute (opens in a new tab)

Workspace - Classic compute architecture

  • Databricks compute resources run in your AWS account.

  • Types

    • all-purpose (opens in a new tab)

    • jobs

    • Lakeflow SDP

    • SQL warehouses

      • Classic

        • Gist
          • Most basic SQL compute, less performance than Pro and Serverless compute.
          • In customer cloud
          • Supports Photon engine
      • Pro

        • More advanced features than Classic (Predictive IO)

        • In customer cloud

        • Supports Photon engine

        • Does not support Intelligent Workload Management

        • Less responsive, slower auto scaling than serverless SQL warehouse

        • Use cases

          • Serverless SQL warehouses are not available in a region.
          • You have custom-defined networking and want to connect to databases in your network in the cloud or on-premises for federation or a hybrid-type architecture.
  • Access modes

    • Standard (shared)
    • Dedicated (single user)
  • Instance pools

    • Prefer serverless compute to instance pool whenever possible
    • More responsive, reduce cluster start and scaling time
Serverless compute

Workspace - Serverless compute

  • Managed by Databricks
  • Customer data isolated by network and security boundary
  • Types
    • Notebooks

    • jobs

    • Lakeflow SDP

    • SQL warehouses

      • Cutting edge features (Predictive IO and Intelligent Workload Management)
      • Most cost performant
      • Use cases
        • ETL
        • Business intelligence
        • Exploratory analysis
SQL warehouses
  • Optimized compute for SQL queries, analytics, and business intelligence workloads with serverless or classic options.

  • Only support SQL cells in Notebooks.

  • Auto Stop

    Whether the warehouse stops if it's idle for the specified number of minutes

  • Cluster size

    The size of the driver node and number of worker nodes associated with the cluster

  • Scaling

    The minimum and maximum number of clusters that will be used for a query

    Databricks recommends a cluster for every 10 concurrent queries.

Warehouse typePhoton EnginePredictive IOIntelligent Workload Management
ServerlessXXX
ProXX
ClassicX

Workspace - Access Control

Workspace - Account admin (opens in a new tab)
  • Can create metastores, and by default become the initial metastore admin.
  • Can link metastores to workspaces.
  • Can assign the metastore admin role.
  • Can grant privileges on metastores.
  • Can enable Delta Sharing for a metastore.
  • Can configure storage credentials.
  • Can enable system tables and delegate access to them.
Workspace - Workspace admin (opens in a new tab)
  • Can add users, service principals, and groups to a workspace.
  • Can delegate other workspace admins.
  • Can manage job ownership.
  • Can manage the job Run as setting.
  • Can view and manage notebooks, dashboards, queries, and other workspace objects.
Workspace - User
  • Represented by an email address
Workspace - Service principal
  • Represented by an application ID
Workspace - Group

Delta tables (opens in a new tab)

Delta table types

Table typeDescription
Unity Catalog managed tableAlways backed by Delta Lake. The default and recommended table type on Databricks. Provides many built-in optimizations.
Unity Catalog external tableOptionally backed by Delta Lake. Supports some legacy integration patterns with external Delta Lake clients.
Unity Catalog foreign tableMight be backed by Delta Lake, depending on the foreign catalog. Foreign tables backed by Delta Lake do not have many optimizations present in Unity Catalog managed tables.
Streaming tableA Lakeflow Spark Declarative Pipelines dataset backed by Delta Lake that includes an append or AUTO CDC ... INTO flow definition for incremental processing.
Hive metastore tableForeign tables in an internal or external federated Hive metastore and tables in the legacy workspace Hive metastore. Both managed and external Hive metastore tables can optionally be backed by Delta Lake.
Materialized viewA Lakeflow Spark Declarative Pipelines dataset backed by Delta Lake that materializes the results of a query using managed flow logic.

Delta tables - feature compatibility and protocols (opens in a new tab)

Interface

REST API

REST - Call API endpoint with CLI (opens in a new tab)

# Example
databricks api get $ENDPOINT_PATH

CLI (opens in a new tab)

CLI - Setup - Workspace-level

UI -> Settings -> Identity and access -> Service principals

# $HOME/.databrickscfg
# DEFAULT profile doesn't need to be specified explicitly
[DEFAULT]
host          = $workspace_url
client_id     = $service_principal-Application_ID
client_secret = $secret

CLI - Show auth info

databricks auth env

CLI - Verify profiles

databricks auth profiles

CLI - Show clusters info

databricks clusters spark-versions

CLI - Troubleshoot CLI command execution

Will display underlying HTTP requests and responses.

databricks $COMMAND --debug

Example:

> databricks workspace list /Workspace/Users/orange.lavender.9@gmail.com --debug
15:00:25 Info: start pid=48008 version=0.272.0 args="databricks, workspace, list, /Workspace/Users/orange.lavender.9@gmail.com, --debug"
15:00:25 Debug: Loading DEFAULT profile from C:\Users\Takechiyo/.databrickscfg pid=48008 sdk=true
15:00:25 Debug: Failed to configure auth: "pat" pid=48008 sdk=true
15:00:25 Debug: Failed to configure auth: "basic" pid=48008 sdk=true
15:00:26 Debug: GET /oidc/.well-known/oauth-authorization-server
< HTTP/2.0 200 OK
< {
<   "authorization_endpoint": "https://dbc-80546867-03ea.cloud.databricks.com/oidc/v1/authorize",
<   "claims_supported": [
<     "iss",
<     "sub",
<     "aud",
<     "iat",
<     "exp",
<     "jti",
<     "name",
<     "family_name",
<     "given_name",
<     "preferred_username"
<   ],
<   "code_challenge_methods_supported": [
<     "S256"
<   ],
<   "grant_types_supported": [
<     "client_credentials",
<     "authorization_code",
<     "refresh_token"
<   ],
<   "id_token_signing_alg_values_supported": [
<     "RS256"
<   ],
<   "issuer": "https://dbc-80546867-03ea.cloud.databricks.com/oidc",
<   "jwks_uri": "https://ohio.cloud.databricks.com/oidc/jwks.json",
<   "request_uri_parameter_supported": false,
<   "response_modes_supported": [
<     "query",
<     "fragment",
<     "form_post"
<   ],
<   "response_types_supported": [
<     "code",
<     "id_token"
<   ],
<   "scopes_supported": [
<     "all-apis",
<     "email",
<     "offline_access",
<     "openid",
<     "profile",
<     "sql"
<   ],
<   "subject_types_supported": [
<     "public"
<   ],
<   "token_endpoint": "https://dbc-80546867-03ea.cloud.databricks.com/oidc/v1/token",
<   "token_endpoint_auth_methods_supported": [
<     "client_secret_basic",
<     "client_secret_post",
<     "none"
<   ]
< } pid=48008 sdk=true
15:00:26 Debug: Generating Databricks OAuth token for Service Principal (c9fa7fd8-0982-4430-89f3-a8723af7441e) pid=48008 sdk=true
15:00:26 Debug: POST /oidc/v1/token
> <http.RoundTripper>
< HTTP/2.0 200 OK
< {
<   "access_token": "**REDACTED**",
<   "expires_in": 3600,
<   "scope": "all-apis",
<   "token_type": "Bearer"
< } pid=48008 sdk=true
15:00:27 Debug: GET /api/2.0/workspace/list?path=/Workspace/Users/orange.lavender.9@gmail.com
< HTTP/2.0 200 OK
< {} pid=48008 sdk=true
ID  Type  Language  Path
15:00:27 Info: completed execution pid=48008 exit_code=0
15:00:27 Debug: no telemetry logs to upload pid=48008

Workspace UI (opens in a new tab)

UI - Get resource URL/path

Workspace UI - Get resource URL/path

PySpark

Lists the contents of a directory

dbutils.fs.ls('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')

Use Ctrl + SPACE to trigger autocomplete.

Read a CSV file

SELECT * FROM csv.`dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/market_data.csv`

Read a text file

SELECT * FROM csv.`dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/README.md`

Read a file in binary format (HEX encoded)

SELECT * FROM binaryFile.`dbfs:/databricks-datasets/travel_recommendations_realtime/README.txt`

Databricks Connect

  • like a JDBC driver, the Databricks Connect is a library that allows you to connect to a Databricks cluster from your own application.
  • Built on Spark Connect

Lakeflow Connect (opens in a new tab)

Lakeflow Connect - Manual File Uploads (opens in a new tab)

Lakeflow Connect - Managed connectors (opens in a new tab)

  • Google Analytics
  • Salesforce
  • Workday
  • SharePoint
  • SQL Server
  • ServiceNow

Lakeflow Connect - Standard connectors (opens in a new tab)

  • Sources

    • Cloud object storage

      • Amazon S3 (s3://)
      • Azure Data Lake Storage (ADLS, abfss://)
      • Google Cloud Storage (GCS, gs://)
      • Azure Blob Storage (wasbs://)
      • Databricks File System (DBFS, dbfs:/)
    • SFTP servers

    • Apache Kafka

    • Amazon Kinesis

    • Google Pub/Sub

    • Apache Pulsar

  • Ingestion methods

    • Batch

      • Gist

        • Load data as batches of rows into Databricks, often based on a schedule.
        • Traditional batch ingestion processes all records each time it runs
      • Usage

        • CREATE TABLE AS (CTAS)
        • spark.read.load()
    • Incremental Batch

      • Gist

        • Only new data is ingested, previously loaded records are skipped automatically.
      • Usage

    • Streaming

      • Gist

        • Continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time.
        • Micro-batch processes small batches a very short, frequent intervals
      • Usage

Standard connectors - Cloud object storage - CREATE TABLE AS (CTAS)

  • Batch method

SQL Reference - CREATE TABLE [USING] (opens in a new tab)

Standard connectors - Cloud object storage - COPY INTO (opens in a new tab)

  • Incremental batch

  • Idempotent

  • Easily configurable file or directory filters from cloud storage

    S3, ADLS, ABFS, GCS, and Unity Catalog volumes

  • Support for multiple source file formats

    CSV, JSON, XML, Avro, ORC, Parquet, text, and binary files

  • Target table schema inference, mapping, merging, and evolution

  • load data from a file location into a Delta table

SQL Reference - COPY INTO (opens in a new tab)

Standard connectors - Cloud object storage - Auto Loader (opens in a new tab)

  • Built on Structured Streaming, providing a Structured Streaming source called cloudFiles.

  • Declarative Pipeline

    • Python

      spark.readStream
        .format("cloudFiles)
    • SQL

      Example:

      CREATE OR REFRESH STREAMING TABLE csv_data (
        id int,
        ts timestamp,
        event string
      )
      AS SELECT * FROM STREAM read_files(
        's3://bucket/path',
        format => 'csv',
        schema => 'id int, ts timestamp, event string'
      );
  • Incremental batch or streaming, alternative and successor to COPY INTO

Auto Loader - Schema Inference and Evolution
  • Specifying a target directory for the option cloudFiles.schemaLocation enables schema inference and evolution.

  • If you use Lakeflow SPD, Databricks manages schema location and other checkpoint information automatically.

  • Schema Inference

    • Use directory _schemas at the configured cloudFiles.schemaLocation to track schema changes to the input data over time.
    • For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files).
    • For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files.
  • Schema Evolution

    • cloudFiles.schemaEvolutionMode

      addNewColumns and failOnNewColumns require human intervention to confirm.

      • addNewColumns (default)

        • Stream fails.

        • New Columns are added to the schema.

        • Existing columns do not evolve data types.

      • rescue

        • Stream doesn't fail.

        • Schema is not evolved.

        • New columns are recorded in the rescued data column.

      • failOnNewColumns

        • Stream fails.

        • Stream does not restart unless the provided schema is updated, or the offending data file is removed.

      • none

        • Stream doesn't fail.

        • Schema is not evolved.

        • New columns are ignored.

Lakeflow Spark Declarative Pipelines (SDP, previously DLT)

Lakeflow Spark Declarative Pipelines

  • Benefits

  • DLT

    • DLT does not support shared clusters. You need to configure the cluster (a jobs cluster) via the DLT use (under workflows).
    • You would still write your notebook with DLT using Python but the pipeline itself has to be configured via the UI and run from the UI.
  • From DLT to LDP

    From DLT to LDP

  • Gist

    • Same CREATE LIVE TABLE DSL, now built on Spark Declarative Pipelines, an open standard you can run anywhere Spark runs (opens in a new tab).
    • New IDE for Data Engineering: side-by-side code & DAG, inline data previews, Git integration and an AI copilot.
    • AUTO CDC ... in SQL / create_auto_cdc_flow() in Python replace manual APPLY CHANGES.
    • Lakeflow SDP provide a declarative approach to defining relationships between datasets and transformations.
  • SQL

    SQL code that creates pipeline datasets uses the CREATE OR REFRESH syntax to define materialized views and streaming tables against query results.

    The STREAM keyword indicates if the data source referenced in a SELECT clause should be read with streaming semantics.

    • LIVE

      Legacy publishing mode uses the LIVE virtual schema (opens in a new tab) to achieve similar behavior. In the default publishing mode (used by all new pipelines), the LIVE keyword is ignored.

      In legacy publishing mode pipelines, you can use the LIVE keyword to reference other datasets in the current pipeline for reads, e.g. SELECT * FROM LIVE.bronze_table. In the default publishing mode for new Lakeflow Spark Declarative Pipelines, this syntax is silently ignored, meaning that unqualified identifiers use the current schema.

  • Resources

Lakeflow SPD - Pipeline Mode (opens in a new tab)

Key questionsTriggeredContinuous
When does the update stop?Automatically once complete.Runs continuously until manually stopped.
What data is processed?Data available when the update starts.All data as it arrives at configured sources.
What data freshness requirements is this best for?Data updates run every 10 minutes, hourly, or daily.Data updates are desired between every 10 seconds and a few minutes.
  • triggered (opens in a new tab)

    • Lakeflow SDP stops processing after successfully refreshing all tables or selected tables, ensuring each table in the update is refreshed based on the data available when the update starts.
    • You can schedule triggered Lakeflow SPD to run as a task in a job.
  • continuous (opens in a new tab)

    Lakeflow SDP processes new data as it arrives in data sources to keep tables throughout the pipeline fresh.

Lakeflow SPD - Pricing (opens in a new tab)

Lakeflow SPD - Flows (opens in a new tab)

  • Lakeflow SPD unique flow types

    • AUTO CDC (opens in a new tab)

      • Streaming flow
      • Handles out of order CDC events and supports both SCD Type 1 and 2 (Auto CDC is not available in Apache Spark Declarative Pipelines).

      Example:

      CREATE OR REFRESH STREAMING TABLE atp_cdc_flow AS
      AUTO CDC INTO atp_silver                                    -- Target table to update with SCD Type 1 or 2
      FROM STREAM atp_bronze                                      -- Source records to determine updates, deletes and inserts
        KEYS (player)                                             -- Primary key for identifying records
        APPLY AS DELETE WHEN operation = "DELETE"                 -- Handle deletes from source to the target
        SEQUENCE BY ranking_date                                  -- Defines order of operations for applying changes
        COLUMNS * EXCEPT (ranking_date, operation, sequenceNum)   -- Select columns and exclude metadata fields
        STORED AS SCD TYPE 1                                      -- Use SCD Type 1
    • Materialized View

      • Batch flow

Lakeflow SPD - Streaming Tables (opens in a new tab)

  • Each input row is handled only once.

  • The STREAM keyword before read_files tells the query to treat the dataset as a stream.

  • Trigger Policy

    • Continuous
    • Once
    • Timed
  • Example

    CREATE OR REFRESH STREAMING TABLE atp_bronze AS
    SELECT
      *
    FROM
      STREAM read_files(
        '/Volumes/workspace/experiment/atp/*.csv',
        format => "csv",
        sep => ",",
        header => true
      );
     
    CREATE OR REFRESH STREAMING TABLE atp_silver AS
    SELECT
      player,
      points
    FROM
      STREAM atp_bronze;
     
    CREATE OR REFRESH MATERIALIZED VIEW atp_gold AS
    SELECT
      player,
      SUM(points) AS total_points
    FROM
      atp_silver
    GROUP BY
      player
    ORDER BY
      player;

Lakeflow SPD - Materialized Views (opens in a new tab)

Lakeflow SPD - Data Quality

RecommendationImpact
Store expectation definitions separately from pipeline logic.Easily apply expectations to multiple datasets or pipelines. Update, audit, and maintain expectations without modifying pipeline source code.
Add custom tags to create groups of related expectations.Filter expectations based on tags.
Apply expectations consistently across similar datasets.Use the same expectations across multiple datasets and pipelines to evaluate identical logic.

Lakeflow SPD - Expectations - On Violation

  • warn

    Invalid records are written to the target.

    EXPECT
  • drop

    Invalid records are dropped before data is written to the target.

    EXPECT ... ON VIOLATION DROP ROW
  • fail

    Invalid records prevent the update from succeeding.

    EXPECT ... ON VIOLATION FAIL UPDATE

Lakeflow SPD - Compute

Lakeflow SPD - Compute - Serverless

Lakeflow SPD - Compute - Classic

Lakeflow SPD - Compute - Classic - Core

  • Declarative Pipelines in Python and SQL

Lakeflow SPD - Compute - Classic - Pro

  • Declarative Pipelines in Python and SQL
  • CDC

Lakeflow SPD - Compute - Classic - Advanced

  • Declarative Pipelines in Python and SQL
  • CDC
  • Data Quality Control

Lakeflow SPD - DLT - SQL

Lakeflow Jobs (previously Workflows)

Lakeflow Jobs - Jobs

  • A job is used to schedule and orchestrate tasks on Databricks in a workflow.

Lakeflow Jobs - Tasks (opens in a new tab)

  • You can schedule triggered Lakeflow SPD to run as a task in a job.

  • Types of tasks

    • Notebook
    • Python script
    • Python wheel
    • SQL
    • Pipeline
    • Dashboards
    • Power BI
    • dbt
    • dbt platform (Beta)
    • JAR
    • Spark Submit
    • Run Job
    • If/else
    • For each

Unity Catalog (opens in a new tab)

  • Data Governance (opens in a new tab)

    • Data Access Control (opens in a new tab)

      LayerPurposeMechanisms
      Workspace-level restrictionsLimit which workspaces can access specific catalogs, external locations, and storage credentialsWorkspace-level bindings
      Privileges and ownershipControl access to catalogs, schemas, tables, and other objectsPrivilege grants to users and groups, object ownership
      Attribute-based policies (ABAC)Use tags and policies to dynamically apply filters and masksABAC policies and governed tags
      Table-level filtering and maskingControl what data users can see within tablesRow filters, column masks, dynamic views
    • Data Access Audit

      Capture and record all access to data

    • Data Lineage (opens in a new tab)

      Tracks and visualizes the origin and usage of data assets, providing transparency and traceability.

      Demo (opens in a new tab)

    • Data Discovery

      Ability to search for and discover authorized assets

  • Gist

    • Each cloud region has its own Unity Catalog instance.
  • Managed storage location (opens in a new tab)

    Associated Unity Catalog objectHow to setRelation to external locations
    MetastoreConfigured by account admin during metastore creation.Cannot overlap an external location.
    Standard catalogSpecified during catalog creation using the MANAGED LOCATION keyword.Must be contained within an external location.
    Foreign catalogSpecified after catalog creation using Catalog Explorer.Must be contained within an external location.
    SchemaSpecified during schema creation using the MANAGED LOCATION keyword.Must be contained within an external location.
    • You can associate a managed storage location with a metastore, catalog, or schema. Managed storage locations at lower levels in the hierarchy override storage locations defined at higher levels when managed tables or managed volumes are created.
    • You can choose to store data at the metastore level, providing a default storage location for catalogs that don't have a managed storage location of their own.
    • Databricks recommends that you assign managed storage at the catalog level for logical data isolation, with metastore-level and schema-level as options.

Unity Catalog - Metastore (opens in a new tab)

  • Top-level container for metadata in Unity Catalog

  • For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

  • You can link a single metastore to multiple workspaces in the same region, giving each workspace the same data view.

  • You should have one metastore for each region in which you have workspaces.

  • There can be multiple metastores in a single account.

  • Ownership

    • Metastore admins are the owners of the metastore.
    • Metastore admins can reassign ownership of the metastore by transferring the metastore admin role.

Unity Catalog - Catalog (opens in a new tab)

  • Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.
  • Your Databricks account has one metastore per region, catalogs are inherently isolated by region.
  • By default, a catalog is shared with all workspaces attached to the current metastore.
  • Workspace-catalog binding (opens in a new tab) can be used to limit catalog access to specific workspaces in your account.

Unity Catalog - Views

View types

Unity Catalog - Volumes (opens in a new tab)

  • Use cases

    • Register landing areas for raw data produced by external systems to support its processing in the early stages of ETL pipelines and other data engineering activities.
    • Register staging locations for ingestion. e.g. using Auto Loader, COPY INTO, or CTAS (CREATE TABLE AS) statements.
    • Provide file storage locations for data scientists, data analysts, and machine learning engineers to use as parts of their exploratory data analysis and other data science tasks.
    • Give Databricks users access to arbitrary files produced and deposited in cloud storage by other systems. For example, large collections of unstructured data (such as image, audio, video, and PDF files) captured by surveillance systems or IoT devices, or library files (JARs and Python wheel files) exported from local dependency management systems or CI/CD pipelines.
    • Store operational data, such as logging or checkpointing files.
    • Recommended for managing all access to non-tabular data in cloud object storage and for storing workload support files.
  • catalog.schema.volume

  • Access pattern

    /Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

    or

    dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

Separation of metastore management

  • Databricks workspaces management before and after Unity Catalog

    Databricks workspaces management before and after Unity Catalog

Centralized user management and access control

  • Identities

    Managed at 2 levels: account level and workspace level

    • Users
    • Groups
    • Service principals
  • Identity federation

    Allow identities to be created once at the account level and then assigned to multiple workspaces as needed.

  • Each catalog typically has its own managed storage location to store managed tables and volumes, providing physical data isolation at the catalog level.

    The Unity Catalog object model

    3 Level hierarchy: catalog.schema.table

    • Catalog

      Object privileges

      Catalog - Object privileges

    • Schema (Database)

      Object privileges

      Schema - Object privileges

    • Table / View / Function

      Object privileges

      Table -  Object privileges

Securable objects (opens in a new tab)

Privilege types by securable object

Privilege types by securable object

Docs - UI (opens in a new tab)

In the GitHub repo, Unity Catalog UI (opens in a new tab)

Docs - CLI (opens in a new tab)

In the GitHub repo, CLI - uc (opens in a new tab)

Unity Catalog - Integration with Confluent stack

Integration with Confluent stack

Unity Catalog - Delta Sharing (opens in a new tab)

Unity Catalog - Predictive Optimization (opens in a new tab)

  • Identifies tables that would benefit from ANALYZE, OPTIMIZE, and VACUUM operations and queues them to run using serverless compute for jobs.
  • Only applies to managed table

Notebooks

Notebooks - Examples - Pivot in SQL (opens in a new tab)

DevOps

Provision Databricks Workspace

Provision Databricks projects and resources

  • Databricks Asset Bundles (opens in a new tab)

    high-level view of a development and CI/CD pipeline with DAB

    • Infrastructure-as-code definition file in YAML format to define resources and configuration
    • Use Databricks CLI to operate
    • Use another CI/CD runner (e.g. Jenkins) to schedule and manage the pipeline for provisioning jobs

Databricks Git Folders (Repos) (opens in a new tab)

  • Gist

    • Clone, push to, and pull from a remote Git repository.
    • Create and manage branches for development work, including merging, rebasing, and resolving conflicts.
    • Create notebooks, including IPYNB notebooks, and edit them and other files.
    • Visually compare differences upon commit and resolve merge conflicts.
  • Git operations (opens in a new tab)

Data Access

ObjectObject identifierFile pathCloud URI
External locationnonoyes
Managed tableyesnono
External tableyesnoyes
Managed volumenoyesno
External volumenoyesyes

File access (opens in a new tab)

  • Locations

    • Unity Catalog volumes

    • Workspace files

    • Cloud object storage

    • DBFS mounts and DBFS root

    • Ephemeral storage attached to the driver node of the cluster

  • Access patterns

    URI-style paths

    URI-style paths

    POSIX-style paths

    POSIX-style paths

Resources