Data Ecosystem

Compute Frameworks

Apache Hadoop (opens in a new tab)

  • Data-intensive computing framework implementing the MapReduce programming model
  • Batch processing of large data sets

Apache Spark

When to use Spark

  • Large-scale data processing

    For most use cases involving extensive data processing, Spark is highly recommended due to its optimization for tasks like table joins, filtering, and aggregation.

  • Data parallelism

    Spark excels at data parallelism, which involves applying the same operation to each element of a large dataset. It's ideal for ETL, analytics reporting, feature engineering, and data preprocessing.

  • Machine learning

    Spark's MLlib and SparkML libraries are optimized for large-scale machine learning algorithms and statistical modeling.

Spark SQL (for SQL user)

  • SQL interface for Spark programs

PySpark (for Python user)

Ray (opens in a new tab)

When to use Ray

  • Task parallelism

    Ray is designed for task parallelism, where multiple tasks run concurrently and independently. It's particularly efficient for computation-focused tasks.

  • Specific workloads

    Use Ray for workloads where Spark is less optimized, such as reinforcement learning, hierarchical time series forecasting, simulation modeling, hyperparameter search, deep learning training, and high-performance computing (HPC).

Spark on Ray (RayDP) (opens in a new tab)

  • Apache Spark replacement

Dask (opens in a new tab)

  • Apache Spark replacement for smaller datasets

Polars

  • Pandas replacement
  • Single-node workload

DuckDB

  • Single-node workload
  • Best choice for small datasets and local jobs

Query Engines

Apache Hive (opens in a new tab)

  • SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop

Apache Pig (opens in a new tab)

  • Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
  • Discontinued in 2017

Presto (opens in a new tab)

  • PB scale
  • Interactive SQL query engine for big data
  • Intended as a replacement for Hive
  • Forked as PrestoDB and Trino separately
  • AWS Athena is based on Presto and Trino later.

Apache Drill (opens in a new tab)

  • Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Metadata Management

Apache Hive Metastore

Unity Catalog

Lakehouse

Delta Lake

Apache Iceberg (opens in a new tab)

  • Open Table Format

  • Enables the use of SQL tables for big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.

  • Addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.

  • Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)

Workflow Scheduler

Apache Airflow (opens in a new tab)

  • Modern first choice for workflow orchestration

Apache Oozie (opens in a new tab)

NoSQL Database

Apache HBase (opens in a new tab)

  • Schema-less
  • Column-oriented NoSQL
  • Requires HDFS for storage
  • Auto-sharding
  • Can be integrated with Hive for SQL-like queries
  • High write throughput, suitable for write heavy workloads

User Interface

Apache Zeppelin (opens in a new tab)

  • Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
  • Good integration with Apache Spark and Apache Hadoop

Hue (Hadoop User Experience) (opens in a new tab)

  • SQL Cloud Editor for Hadoop workloads

Data Ingestion

Apache NiFi (opens in a new tab)

Apache Kafka (opens in a new tab)

Apache Flume (opens in a new tab)

Apache Sqoop (opens in a new tab)

  • Transfer bulk data between Hadoop and RDBMS
  • Discontinued in 2021