Apache Spark

Overview

Driver

  • The process running the main() function of the application and creating the SparkContext
  • Contains the SparkContext object
  • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.
  • Data cannot be shared across different Spark applications (instances of SparkContext), meaning no inter-process sharing.
  • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network to minimize network latency between the drivers and the executors.

Driver - Deploy mode

  • Client (default)

    locally as an external client

  • Cluster

    deployed on the worker nodes

Cluster Manager

  • Types

    • Standalone
    • Hadoop YARN
    • Kubernetes
  • Master URL

    • Kubernetes

      k8s://https://${k8s_apiserver_host}:${k8s_apiserver_port}

Playground

Code

Deployment

Docker

Resources

AWS Prescriptive Guidance - Key topics in Apache Spark (opens in a new tab)

References

Apache Spark JIRA (opens in a new tab)

Spark - ScalaDoc (opens in a new tab)

Spark - Configuration (opens in a new tab)

PySpark

Reference - PySpark API Reference (opens in a new tab)

Reference - PySpark - Examples (opens in a new tab)

PySpark - Python Spark Connect Client

PySpark - Spark SQL and DataFrames

Spark SQL, DataFrames and Datasets Guide (opens in a new tab)

PySpark - Spark SQL API Reference (opens in a new tab)

PySpark - Pandas API on Spark

PySpark - Structured Streaming (opens in a new tab)

PySpark - Machine Learning (MLlib)

PySpark - Spark Core and RDDs

PySpark - Compatibility Matrix

Spark Python Supportability Matrix

Spark SQL

Spark SQL - Built-in Functions (opens in a new tab)

Global Temporary View (opens in a new tab)

Structured Streaming (opens in a new tab)

Input Sources

  • Built-in sources

    • File source
    • Kafka source
    • Socket source (testing)
    • Rate source (testing)
    • Rate per Micro-Batch source (testing)

Stateful streaming (opens in a new tab)

  • A stateful Structured Streaming query requires incremental updates to intermediate state information.

    • Streaming aggregation
    • Streaming dropDuplicates
    • Stream-stream joins
    • Custom stateful applications
  • A stateless Structured Streaming query only tracks information about which rows have been processed from the source to the sink.

Output mode (opens in a new tab)

  • Only stateful streams containing aggregations require an output mode configuration.
  • For stateless streaming, all output modes behave the same.
  • Output mode
    • Append mode (default)
    • Update mode
    • Complete mode

Streaming Joins (opens in a new tab)

Stream-Static Joins (opens in a new tab)

  • Inner joins
  • Left outer joins (left side is a streaming DataFrame)
  • Right outer joins (right side is a streaming DataFrame)

Stream-Stream Joins (opens in a new tab)

Trigger intervals (opens in a new tab)

By default, ProcessingTime=500ms

  • Fixed interval micro-batches

  • Continuous with fixed checkpoint interval

    Continuously process data as it arrives

  • Available-now micro-batch (replaces One-time micro-batch)

    Consumes all available records as an incremental batch with the ability to configure batch size

Watermarking

  • Gist

    • Allow Spark to understand when to close the aggregate window and produce the correct aggregate result.

    • Spark waits to close and output the windowed aggregation once the max event time seen minus the specified watermark is greater than the upper bound of the window.

    • Watermarks can only be used when you are running your streaming application in append or update output modes.

    • You define a watermark by specifying a timestamp field and a value representing the time threshold for late data to arrive.

      WATERMARK timestamp DELAY OF INTERVAL 3 MINUTES
  • Feature Deep Dive: Watermarking in Apache Spark Structured Streaming (opens in a new tab)

Spark Declarative Pipelines (SDP)

SDP - Flows

  • A flow is a transformation relationship from source to target dataset.
  • Supports both streaming and batch semantics
  • SDP shares the same streaming flow type (Append, Update, Complete) as Spark Structured Streaming.

SDP - Datasets

  • Streaming Table

    • Incremental processing to process only new data as it arrives
  • Materialized View

    • Precomputed
    • Always has exactly one batch flow writing to it
  • Temporary View

SDP - Pipelines

  • A pipeline can contain one or more flows, streaming tables, and materialized views.

SDP - Pipeline Projects

  • A pipeline project is a set of source files that contain code that define the datasets and flows that make up a pipeline.
  • Source files can be .py or .sql files.
  • Pipeline spec file by default is called pipeline.yml.

SDP - Python

  • SDP evaluates the code that defines a pipeline multiple times during planning and pipeline runs. Python functions that define datasets should include only the code required to define the table or view.
  • The function used to define a dataset must return a Spark DataFrame.
  • Never use methods that save or write to files or tables as part of your SDP dataset code.

SDP - SQL

  • The PIVOT clause is not supported in SDP SQL.
  • When using the for loop pattern to define datasets in Python, ensure that the list of values passed to the for loop is always additive.

Kubernetes

Kubernetes - Kubeflow Spark Operator (opens in a new tab)

Local set up

  1. Install a k8s cluster (k3d, minikube, kind, etc.)

  2. Install Helm

  3. Add Helm Repo

    helm repo add spark-operator https://kubeflow.github.io/spark-operator
    helm repo update
  4. Install Spark Operator chart

    helm install $release spark-operator/spark-operator \
      --namespace $namespace \
      --create-namespace
  5. Install a service account for Spark Operator

    kubectl apply -f https://github.com/kubeflow/spark-operator/blob/master/config/rbac/spark-application-rbac.yaml
  6. Run a Spark application as demo

    Ensure the name of the service account created in the previous step is used here.

    kubectl apply -f https://github.com/kubeflow/spark-operator/blob/master/examples/spark-pi.yaml
  7. Check the Spark application status

    kubectl get sparkapp $spark_app_name -A

Kubernetes - Apache Spark Kubernetes Operator (opens in a new tab)

  • Official project
  • Younger project compared to Kubeflow

Catalog API

Apache Gluten

  • Gist

    • Transform Spark’s whole stage physical plan to Substrait plan and send to native
    • Offload performance-critical data processing to native library
    • Define clear JNI interfaces for native libraries
    • Switch available native backends easily
    • Reuse Spark’s distributed control flow
    • Manage data sharing between JVM and native
    • Extensible to support more native accelerators

Performance Tuning

Spark UI

Symptoms - Spill

  • Spill is what happens when Spark runs low on memory. It starts to move data from memory to disk, and this can be quite expensive.
  • It is most common during data shuffling.

Symptoms - Skew

  • Skew is when one or just a few tasks take much longer than the rest.
  • This results in poor cluster utilization and longer jobs.

UDF

UDF - SQL (opens in a new tab)

-- Example
CREATE OR REPLACE FUNCTION my_catalog.my_schema.calculate_bmi(weight DOUBLE, height DOUBLE)
RETURNS DOUBLE
LANGUAGE SQL
RETURN
SELECT weight / (height * height);