Apache Hive – Nextra

Storage

The warehouse directory path is /user/hive/warehouse.

Interface - Beeline

Directly
```
beeline -u jdbc:hive2://localhost:10000
```

With Docker

dxcit $CONTAINER_NAME beeline -u 'jdbc:hive2://localhost:10000'

Sample output

2024-03-27 16:20:03,590 WARN  [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
0: jdbc:hive2://localhost:10000/>

Interface - HiveServer2 Web UI

http://localhost:10002/

Insights

Insights - Drawbacks

The main drawback is the lack of grained metadata and relying on the directory listing the partitions with O(N) complexity for query planning.

The table and partition metadata are stored in Hive Metastore backed by an RDBMS and it is required to be queried to find out which directories should be read for the requested data.

Then, the directories are listed to find out the actual data files.

Hive Metastore and directory listing causes great bottlenecks with increased partition count on a table, especially given that the directory listing on object storage is actually scanning for files with a prefix and it is quite costly.

The next-generation table formats leverage metadata to do the heavy lifting. The metadata defines the table structure, partitions, and the data files that compose the table, eliminating the need to query a metastore and list directories.

Apache Iceberg uses a snapshot approach and performs an O(1) RPC to read the snapshot file, therefore Apache Iceberg tables can scale easily without worrying about the performance with increased partition count.

Reference

Docker Image (opens in a new tab)

Apache Spark 番茄炒蛋