Compute Frameworks
Apache Hadoop (opens in a new tab)
- Data-intensive computing framework implementing the MapReduce programming model
- Batch processing of large data sets
Apache Spark
When to use Spark
-
Large-scale data processing
For most use cases involving extensive data processing, Spark is highly recommended due to its optimization for tasks like
table joins,filtering, andaggregation. -
Data parallelism
Spark excels at
data parallelism, which involvesapplying the same operation to each element of a large dataset. It's ideal forETL,analytics reporting,feature engineering, anddata preprocessing. -
Machine learning
Spark's MLlib and SparkML libraries are optimized for large-scale machine learning algorithms and statistical modeling.
Spark SQL (for SQL user)
- SQL interface for Spark programs
PySpark (for Python user)
Ray (opens in a new tab)
When to use Ray
-
Task parallelism
Ray is designed for
task parallelism, where multiple tasks run concurrently and independently. It'sparticularly efficient for computation-focused tasks. -
Specific workloads
Use Ray for workloads where Spark is less optimized, such as
reinforcement learning,hierarchical time series forecasting,simulation modeling,hyperparameter search,deep learning training, andhigh-performance computing (HPC).
Spark on Ray (RayDP) (opens in a new tab)
Apache Sparkreplacement
Dask (opens in a new tab)
Apache Sparkreplacement for smaller datasets
Polars
Pandasreplacement- Single-node workload
DuckDB
- Single-node workload
- Best choice for small datasets and local jobs
Query Engines
Apache Hive (opens in a new tab)
- SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop
Apache Pig (opens in a new tab)
- Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
- Discontinued in 2017
Presto (opens in a new tab)
PBscale- Interactive SQL query engine for big data
- Intended as a replacement for
Hive - Forked as
PrestoDBandTrinoseparately AWS Athenais based onPrestoandTrinolater.
Apache Drill (opens in a new tab)
- Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Metadata Management
Apache Hive Metastore
Unity Catalog
Lakehouse
Delta Lake
Apache Iceberg (opens in a new tab)
-
Open Table Format -
Enables the use of
SQLtables for big data, while making it possible for engines likeSpark,Trino,Flink,Presto,Hive,Impala,StarRocks,Doris, andPigto safely work with the same tables, at the same time. -
Addresses the performance and usability challenges of using
Apache Hivetables in large and demandingdata lakeenvironments. -
Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)
Workflow Scheduler
Apache Airflow (opens in a new tab)
- Modern first choice for workflow orchestration
Apache Oozie (opens in a new tab)
NoSQL Database
Apache HBase (opens in a new tab)
- Schema-less
- Column-oriented NoSQL
- Requires
HDFSfor storage - Auto-sharding
- Can be integrated with
HiveforSQL-likequeries - High write throughput, suitable for write heavy workloads
User Interface
Apache Zeppelin (opens in a new tab)
- Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
- Good integration with
Apache SparkandApache Hadoop
Hue (Hadoop User Experience) (opens in a new tab)
- SQL Cloud Editor for Hadoop workloads
Data Ingestion
Apache NiFi (opens in a new tab)
Apache Kafka (opens in a new tab)
Apache Flume (opens in a new tab)
Apache Sqoop (opens in a new tab)
- Transfer bulk data between
HadoopandRDBMS - Discontinued in 2021