Big Data - Computing
Apache Hadoop (opens in a new tab)
- Data-intensive computing framework implementing the MapReduce programming model
- Batch processing of large data sets
Apache Spark
- Batch processing, streaming, machine learning, and graph processing
Apache Tez (opens in a new tab)
Spark SQL
- SQL interface for Spark programs
Big Data - Query Engines
Apache Hive (opens in a new tab)
- SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop
Apache Pig (opens in a new tab)
- Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
- Discontinued in 2017
Presto (opens in a new tab)
PB
scale- Interactive SQL query engine for big data
- Intended as a replacement for
Hive
- Forked as
PrestoDB
andTrino
separately AWS Athena
is based onPresto
andTrino
later.
Apache Drill (opens in a new tab)
- Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Big Data - Metadata Management
Apache Hive Metastore
Apache Iceberg (opens in a new tab)
-
Open Table Format
-
Enables the use of
SQL
tables for big data, while making it possible for engines likeSpark
,Trino
,Flink
,Presto
,Hive
,Impala
,StarRocks
,Doris
, andPig
to safely work with the same tables, at the same time. -
Addresses the performance and usability challenges of using
Apache Hive
tables in large and demandingdata lake
environments. -
Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)
Big Data - Workflow Scheduler
Apache Airflow (opens in a new tab)
- Modern first choice for workflow orchestration
Apache Oozie (opens in a new tab)
Big Data - Database
Apache HBase (opens in a new tab)
- Schema-less
- Column-oriented NoSQL
- Requires
HDFS
for storage - Auto-sharding
- Can be integrated with
Hive
forSQL-like
queries - High write throughput, suitable for write heavy workloads
Big Data - User Interface
Apache Zeppelin (opens in a new tab)
- Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
- Good integration with
Apache Spark
andApache Hadoop
Hue (Hadoop User Experience) (opens in a new tab)
- SQL Cloud Editor for Hadoop workloads
Big Data - Data Ingestion
Apache NiFi (opens in a new tab)
Apache Kafka (opens in a new tab)
Apache Flume (opens in a new tab)
Apache Sqoop (opens in a new tab)
- Transfer bulk data between
Hadoop
andRDBMS
- Discontinued in 2021