Big Data Ecosystem

Big Data - Computing

Apache Hadoop (opens in a new tab)

  • Data-intensive computing framework implementing the MapReduce programming model
  • Batch processing of large data sets

Apache Spark

  • Batch processing, streaming, machine learning, and graph processing

Apache Tez (opens in a new tab)

Spark SQL

  • SQL interface for Spark programs

Big Data - Query Engines

Apache Hive (opens in a new tab)

  • SQL-like (HiveQL) abstraction over Hadoop Java API to enable SQL analytics for big data on Hadoop

Apache Pig (opens in a new tab)

  • Another DSL abstraction over Hadoop Java API to simplify data analytics for non-Java developers
  • Discontinued in 2017

Presto (opens in a new tab)

  • PB scale
  • Interactive SQL query engine for big data
  • Intended as a replacement for Hive
  • Forked as PrestoDB and Trino separately
  • AWS Athena is based on Presto and Trino later.

Apache Drill (opens in a new tab)

  • Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Big Data - Metadata Management

Apache Hive Metastore

Apache Iceberg (opens in a new tab)

  • Open Table Format

  • Enables the use of SQL tables for big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.

  • Addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.

  • Apache Iceberg - Spark and Iceberg Quickstart (opens in a new tab)

Big Data - Workflow Scheduler

Apache Airflow (opens in a new tab)

  • Modern first choice for workflow orchestration

Apache Oozie (opens in a new tab)

Big Data - Database

Apache HBase (opens in a new tab)

  • Schema-less
  • Column-oriented NoSQL
  • Requires HDFS for storage
  • Auto-sharding
  • Can be integrated with Hive for SQL-like queries
  • High write throughput, suitable for write heavy workloads

Big Data - User Interface

Apache Zeppelin (opens in a new tab)

  • Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R, and more.
  • Good integration with Apache Spark and Apache Hadoop

Hue (Hadoop User Experience) (opens in a new tab)

  • SQL Cloud Editor for Hadoop workloads

Big Data - Data Ingestion

Apache NiFi (opens in a new tab)

Apache Kafka (opens in a new tab)

Apache Flume (opens in a new tab)

Apache Sqoop (opens in a new tab)

  • Transfer bulk data between Hadoop and RDBMS
  • Discontinued in 2021