Data Ecosystem

Compute Frameworks

DataFrames at Scale Comparison: TPC-H (opens in a new tab)

Apache Hadoop (opens in a new tab)

Data-intensive computing framework implementing the MapReduce programming model
Batch processing of large data sets

Apache Spark

When to use Spark

Large-scale data processing

For most use cases involving extensive data processing, Spark is highly recommended due to its optimization for tasks like table joins, filtering, and aggregation.
Data parallelism

Spark excels at data parallelism, which involves applying the same operation to each element of a large dataset. It's ideal for ETL, analytics reporting, feature engineering, and data preprocessing.
Machine learning

Spark's MLlib and SparkML libraries are optimized for large-scale machine learning algorithms and statistical modeling.

Spark SQL (for SQL user)

SQL interface for Spark programs

PySpark (for Python user)

Ray (opens in a new tab)

When to use Ray

Task parallelism

Ray is designed for task parallelism, where multiple tasks run concurrently and independently. It's particularly efficient for computation-focused tasks.
Specific workloads

Use Ray for workloads where Spark is less optimized, such as reinforcement learning, hierarchical time series forecasting, simulation modeling, hyperparameter search, deep learning training, and high-performance computing (HPC).

`Spark on Ray (RayDP)` (opens in a new tab)

Apache Spark replacement

Dask (opens in a new tab)

Apache Spark replacement for smaller datasets

Polars

Pandas replacement
Single-node workload

DuckDB

Single-node workload
Best choice for small datasets and local jobs

Query Engines

Apache Hive (opens in a new tab)

SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop

Apache Pig (opens in a new tab)

Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
Discontinued in 2017

Presto (opens in a new tab)

PB scale
Interactive SQL query engine for big data
Intended as a replacement for Hive
Forked as PrestoDB and Trino separately
AWS Athena is based on Presto and Trino later.

Apache Drill (opens in a new tab)

Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Metadata Management

AWS Big Data Blog - Choosing an open table format for your transactional data lake on AWS (opens in a new tab)

Apache Hive Metastore

Stored in RDBMS
lakeFS - Hive Metastore – It Didn’t Age Well (opens in a new tab)

Unity Catalog

Lakehouse

Delta Lake

Apache Iceberg (opens in a new tab)

Open Table Format
Enables the use of SQL tables for big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.
Addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.
Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)

Workflow Scheduler

Apache Airflow (opens in a new tab)

Modern first choice for workflow orchestration

Apache Oozie (opens in a new tab)

NoSQL Database

Apache HBase (opens in a new tab)

Schema-less
Column-oriented NoSQL
Requires HDFS for storage
Auto-sharding
Can be integrated with Hive for SQL-like queries
High write throughput, suitable for write heavy workloads

User Interface

Apache Zeppelin (opens in a new tab)

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
Good integration with Apache Spark and Apache Hadoop

Compute Frameworks

Apache Hadoop (opens in a new tab)

Apache Spark

Spark SQL (for SQL user)

PySpark (for Python user)

Ray (opens in a new tab)

`Spark on Ray (RayDP)` (opens in a new tab)

Dask (opens in a new tab)

Polars

DuckDB

Query Engines

Apache Hive (opens in a new tab)

Apache Pig (opens in a new tab)

Presto (opens in a new tab)

Apache Drill (opens in a new tab)

Metadata Management

Apache Hive Metastore

Unity Catalog

Lakehouse

Delta Lake

Apache Iceberg (opens in a new tab)

Workflow Scheduler

Apache Airflow (opens in a new tab)

Apache Oozie (opens in a new tab)

NoSQL Database

Apache HBase (opens in a new tab)

User Interface

Apache Zeppelin (opens in a new tab)

Hue (Hadoop User Experience) (opens in a new tab)

Data Ingestion

Apache NiFi (opens in a new tab)

Apache Kafka (opens in a new tab)

Apache Flume (opens in a new tab)

Apache Sqoop (opens in a new tab)