Big Data Ecosystem – Nextra

Big Data - Computing

Apache Hadoop (opens in a new tab)

Data-intensive computing framework implementing the MapReduce programming model
Batch processing of large data sets

Apache Spark

Batch processing, streaming, machine learning, and graph processing

Apache Tez (opens in a new tab)

Spark SQL

SQL interface for Spark programs

Big Data - Query Engines

Apache Hive (opens in a new tab)

SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop

Apache Pig (opens in a new tab)

Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
Discontinued in 2017

Presto (opens in a new tab)

PB scale
Interactive SQL query engine for big data
Intended as a replacement for Hive
Forked as PrestoDB and Trino separately
AWS Athena is based on Presto and Trino later.

Apache Drill (opens in a new tab)

Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Big Data - Metadata Management

AWS Big Data Blog - Choosing an open table format for your transactional data lake on AWS (opens in a new tab)

Apache Hive Metastore

Stored in RDBMS
lakeFS - Hive Metastore – It Didn’t Age Well (opens in a new tab)

Apache Iceberg (opens in a new tab)

Open Table Format
Enables the use of SQL tables for big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.
Addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.
Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)

Big Data - Workflow Scheduler

Apache Airflow (opens in a new tab)

Modern first choice for workflow orchestration

Apache Oozie (opens in a new tab)

Big Data - Database

Apache HBase (opens in a new tab)

Schema-less
Column-oriented NoSQL
Requires HDFS for storage
Auto-sharding
Can be integrated with Hive for SQL-like queries
High write throughput, suitable for write heavy workloads

Big Data - User Interface

Apache Zeppelin (opens in a new tab)

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
Good integration with Apache Spark and Apache Hadoop

Hue (Hadoop User Experience) (opens in a new tab)

SQL Cloud Editor for Hadoop workloads

Big Data - Data Ingestion

Apache NiFi (opens in a new tab)

Apache Kafka (opens in a new tab)

Apache Flume (opens in a new tab)

Apache Sqoop (opens in a new tab)

Transfer bulk data between Hadoop and RDBMS
Discontinued in 2021

Grafana Deep Learning