Overview
Driver
- The process running the
main()function of the application and creating theSparkContext - Contains the
SparkContextobject - Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.
- Data cannot be shared across different Spark applications (instances of
SparkContext), meaning no inter-process sharing. - Because the
driverschedules tasks on the cluster, it should be runclose to the worker nodes,preferably on the same local area networkto minimize network latency between thedriversand theexecutors.
Driver - Deploy mode
-
Client (default)
locally as an external client
-
Cluster
deployed on the worker nodes
Cluster Manager
-
Types
- Standalone
- Hadoop YARN
- Kubernetes
-
Master URL
-
Kubernetes
k8s://https://${k8s_apiserver_host}:${k8s_apiserver_port}
-
Playground
-
How to quickly set up a local Spark development environment (opens in a new tab)
Spark + Iceberg + Jupyter
Code
Deployment
Docker
Resources
AWS Prescriptive Guidance - Key topics in Apache Spark (opens in a new tab)
References
Apache Spark JIRA (opens in a new tab)
Spark - ScalaDoc (opens in a new tab)
Spark - Configuration (opens in a new tab)
PySpark
Reference - PySpark API Reference (opens in a new tab)
Reference - PySpark - Examples (opens in a new tab)
PySpark - Python Spark Connect Client
PySpark - Spark SQL and DataFrames
Spark SQL, DataFrames and Datasets Guide (opens in a new tab)
PySpark - Spark SQL API Reference (opens in a new tab)
PySpark - Pandas API on Spark
PySpark - Structured Streaming (opens in a new tab)
PySpark - Machine Learning (MLlib)
PySpark - Spark Core and RDDs
PySpark - Compatibility Matrix

Spark SQL
Spark SQL - Built-in Functions (opens in a new tab)
Global Temporary View (opens in a new tab)
Structured Streaming (opens in a new tab)
- Structured Streaming Programming Guide - API using Datasets and DataFrames
- Structured Streaming - PySpark API Reference (opens in a new tab)
- The Internals of Spark Structured Streaming (opens in a new tab)
Input Sources
-
Built-in sources
FilesourceKafkasourceSocketsource (testing)Ratesource (testing)RateperMicro-Batchsource (testing)
Stateful streaming (opens in a new tab)
-
A
statefulStructured Streaming queryrequires incremental updates tointermediate state information.- Streaming aggregation
- Streaming
dropDuplicates - Stream-stream joins
- Custom stateful applications
-
A
statelessStructured Streaming queryonly tracks information aboutwhich rows have been processedfrom the source to the sink.
Output mode (opens in a new tab)
- Only
stateful streamscontaining aggregations require anoutput modeconfiguration. - For
stateless streaming, alloutput modesbehave the same. - Output mode
- Append mode (
default) - Update mode
- Complete mode
- Append mode (
Streaming Joins (opens in a new tab)
Stream-Static Joins (opens in a new tab)
InnerjoinsLeftouter joins (left sideis a streamingDataFrame)Rightouter joins (right sideis a streamingDataFrame)
Stream-Stream Joins (opens in a new tab)
Trigger intervals (opens in a new tab)
By default, ProcessingTime=500ms
-
Fixed interval micro-batches
-
Continuous with fixed checkpoint interval
Continuously process data as it arrives
-
Available-nowmicro-batch (replaces One-time micro-batch)Consumes all available records as an incremental batch with the ability to configure batch size
Watermarking
-
Gist
-
Allow Spark to understand
when to close the aggregate windowand produce the correct aggregate result. -
Spark waits to close and output the
windowed aggregationoncethe max event time seen minus the specified watermark is greater than the upper bound of the window. -
Watermarks can only be used when you are running your streaming application in
appendorupdateoutput modes. -
You define a watermark by specifying
a timestamp field and a value representing the time threshold for late data to arrive.WATERMARK timestamp DELAY OF INTERVAL 3 MINUTES
-
-
Feature Deep Dive: Watermarking in Apache Spark Structured Streaming (opens in a new tab)
Spark Declarative Pipelines (SDP)
-
Gist
- Apache Spark
4.1+
- Apache Spark
-
Resources
SDP - Flows
- A flow is a
transformation relationship from source to target dataset. - Supports both
streamingandbatchsemantics SDPshares the same streaming flow type (Append,Update,Complete) asSpark Structured Streaming.
SDP - Datasets
-
Streaming Table
- Incremental processing to process only new data as it arrives
-
Materialized View
- Precomputed
- Always has exactly one batch flow writing to it
-
Temporary View
SDP - Pipelines
- A pipeline can contain one or more flows, streaming tables, and materialized views.
SDP - Pipeline Projects
- A
pipeline projectis a set ofsource filesthat contain code that define thedatasetsandflowsthat make up apipeline. - Source files can be
.pyor.sqlfiles. - Pipeline spec file by default is called
pipeline.yml.
SDP - Python
- SDP evaluates the code that defines a pipeline
multiple timesduringplanningandpipeline runs.Python functionsthat define datasets should includeonly the code required to define the table or view. - The function used to define a dataset must return a
Spark DataFrame. - Never use methods that
save or write to files or tablesas part of your SDP dataset code.
SDP - SQL
- The
PIVOTclause is not supported inSDP SQL. - When using the
for loop patternto define datasets inPython, ensure that the list of values passed to thefor loopisalways additive.
Kubernetes
Kubernetes - Kubeflow Spark Operator (opens in a new tab)
Local set up
-
Install a k8s cluster (k3d, minikube, kind, etc.)
-
Install Helm
-
Add Helm Repo
helm repo add spark-operator https://kubeflow.github.io/spark-operator helm repo update -
Install Spark Operator chart
helm install $release spark-operator/spark-operator \ --namespace $namespace \ --create-namespace -
Install a service account for Spark Operator
kubectl apply -f https://github.com/kubeflow/spark-operator/blob/master/config/rbac/spark-application-rbac.yaml -
Run a Spark application as demo
Ensure the name of the service account created in the previous step is used here.
kubectl apply -f https://github.com/kubeflow/spark-operator/blob/master/examples/spark-pi.yaml -
Check the Spark application status
kubectl get sparkapp $spark_app_name -A
Kubernetes - Apache Spark Kubernetes Operator (opens in a new tab)
- Official project
- Younger project compared to Kubeflow
Catalog API
Apache Gluten
-
Gist
- Transform Spark’s whole stage physical plan to Substrait plan and send to native
- Offload performance-critical data processing to native library
- Define clear JNI interfaces for native libraries
- Switch available native backends easily
- Reuse Spark’s distributed control flow
- Manage data sharing between JVM and native
- Extensible to support more native accelerators
Performance Tuning
Spark UI
- Spark Web UI – Understanding Spark Execution (opens in a new tab)
- Databricks Docs - Debugging with the Spark UI (opens in a new tab)
Symptoms - Spill
- Spill is what happens
when Spark runs low on memory. It starts tomove data from memory to disk, and this can be quite expensive. - It is most common during
data shuffling.
Symptoms - Skew
- Skew is when one or just a few tasks take much longer than the rest.
- This results in poor cluster utilization and longer jobs.
UDF
UDF - SQL (opens in a new tab)
-- Example
CREATE OR REPLACE FUNCTION my_catalog.my_schema.calculate_bmi(weight DOUBLE, height DOUBLE)
RETURNS DOUBLE
LANGUAGE SQL
RETURN
SELECT weight / (height * height);