Databricks Certified - Data Engineer Professional (opens in a new tab)
LEARNING PATHWAY 2: PROFESSIONAL DATA ENGINEERING
1. Databricks Streaming and Lakeflow Spark Declarative Pipelines (opens in a new tab)
2. Databricks Data Privacy (opens in a new tab)
3. Databricks Performance Optimization (opens in a new tab)
4. Automated Deployment with Databricks Asset Bundle (opens in a new tab)
Exam Guide
Section 1: Developing Code for Data Processing using Python and SQL - 22%
-
Using Python and Tools for development
- Design and implement a scalable Python project structure optimized for
Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration. - Manage and troubleshoot external third-party library installations and dependencies in Databricks, including
PyPIpackages, local wheels, and source archives. - Develop
User-Defined Functions (UDFs)using Pandas/Python UDF.
- Design and implement a scalable Python project structure optimized for
-
Building and Testing an
ETLpipeline withLakeflow Declarative Pipelines,SQL, andApache Sparkon the Databricks Platform- Build and manage reliable, production-ready data pipelines, for batch and streaming data using
Lakeflow Declarative PipelinesandAutoloader. - Create and Automate
ETLworkloads usingJobsviaUI/APIs/CLI. - Explain the advantages and disadvantages of streaming tables compared to materialized views.
- Use
APPLY CHANGESAPIs to simplifyCDCinLakeflow Declarative Pipelines. - Compare
Spark Structured StreamingandLakeflow Declarative Pipelinesto determine the optimal approach for building scalable ETL pipelines. - Create a pipeline component that uses control flow operators (e.g., if/else, for/each, etc.).
- Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.
- Develop unit and integration tests using
assertDataFrameEqual,assertSchemaEqual,DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.
- Build and manage reliable, production-ready data pipelines, for batch and streaming data using
Section 2: Data Ingestion & Acquisition - 7%
- Design and implement data ingestion pipelines to efficiently ingest a variety of data formats including
Delta Lake,Parquet,ORC,AVRO,JSON,CSV,XML,TextandBinaryfrom diverse sources such as message buses and cloud storage. - Create an append-only data pipeline capable of handling both batch and streaming data using
Delta
Section 3: Data Transformation, Cleansing, and Quality - 10%
- Write efficient S
park SQLand PySparkcode to apply advanced data transformations, includingwindow functions,joins,and aggregations,to manipulate and analyze large Datasets. - Develop a quarantining process for bad data with
Lakeflow Declarative Pipelinesorautoloaderin classic jobs.
Section 4: Data Sharing and Federation - 5%
- Demonstrate
delta sharingsecurely between Databricks deployments usingDatabricks to Databricks Sharing (D2D)or to external platforms usingopen sharing protocol(D2O). - Configure
Lakehouse Federationwith proper governance across supported source Systems. - Use
Delta Shareto share live data from Lakehouse to any computing platform.
Section 5: Monitoring and Alerting - 10%
-
Monitoring
- Use system tables for observability over resource utilization, cost, auditing and workload monitoring.
- Use
Query Profiler UIandSpark UIto monitor workloads. - Use the
Databricks REST APIs/Databricks CLIfor monitoring jobs and pipelines. - Use
Lakeflow Declarative Pipelines Event Logsto monitor pipelines.
-
Alerting
- Use
SQL Alertsto monitor data quality. - Use the
Workflows UIandJobs APIto set up job status and performance issue notifications
- Use
Section 6: Cost & Performance Optimisation - 13%
- Understand how / why using
Unity Catalogmanaged tables reduces operations overhead and maintenance burden. - Understand delta optimization techniques, such as deletion vectors and
liquid clustering. - Understand the optimization techniques used by
Databricksto ensure performance of queries on large datasets (data skipping, file pruning, etc.). - Apply
Change Data Feed (CDF)to address specific limitations of streaming tables and enhance latency. - Use query profile to analyze the query and identify bottlenecks, such as bad data skipping, inefficient types of joins, data shuffling.
Section 7: Ensuring Data Security and Compliance - 10%
-
Applying Data Security mechanisms.
- Use ACLs to secure Workspace Objects, enforcing the principle of least privilege, including enforcing principles like least privilege, policy enforcement.
- Use row filters and column masks to filter and mask sensitive table data.
- Apply anonymization and pseudonymization methods, such as
Hashing,Tokenization,Suppression, andGeneralization, to confidential data.
-
Ensuring Compliance
- Implement a compliant batch & streaming pipeline that detects and applies
masking of PIIto ensure data privacy. - Develop a data purging solution ensuring compliance with data retention policies.
- Implement a compliant batch & streaming pipeline that detects and applies
Section 8: Data Governance - 7%
- Create and add descriptions/metadata about enterprise data to make it more discoverable.
- Demonstrate understanding of
Unity Catalogpermission inheritance model.
Section 9: Debugging and Deploying - 10%
-
Debugging and Troubleshooting
- Identify pertinent diagnostic information using
Spark UI, cluster logs, system tables, and query profiles to troubleshoot errors. - Analyze the errors and remediate the failed job runs with job repairs and parameter overrides.
- Use
Lakeflow Declarative Pipelinesevent logs & theSpark UIto debugLakeflow Declarative PipelinesandSparkpipelines.
- Identify pertinent diagnostic information using
-
Deploying CI/CD
- Build and deploy
Databricksresources usingDatabricks Asset Bundles. - Configure and integrate with
Git-basedCI/CD workflows usingDatabricks Git Foldersfor notebook and code deployment.
- Build and deploy
Section 10: Data Modelling - 6%
- Design and implement scalable data models using
Delta Laketo manage large datasets. - Simplify data layout decisions and optimize query performance using
Liquid Clustering. - Identify the benefits of using
liquid ClusteringoverPartitioningandZOrder. - Design
Dimensional Modelsfor analytical workloads, ensuring efficient querying and aggregation.