Connection
-
Beeline
-
Directly
beeline -u jdbc:hive2://localhost:10000
-
With Docker
dxcit $CONTAINER_NAME beeline -u 'jdbc:hive2://localhost:10000'
-
Sample output
2024-03-27 16:20:03,590 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it. scan complete in 2ms Connecting to jdbc:hive2://localhost:10000/ Connected to: Apache Hive (version 1.1.0-cdh5.7.0) Driver: Hive JDBC (version 1.1.0-cdh5.7.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.1.0-cdh5.7.0 by Apache Hive 0: jdbc:hive2://localhost:10000/>
-
Insights
Insights - Drawbacks
The main drawback is the lack of grained metadata and relying on the directory listing the partitions with O(N) complexity for query planning.
The table and partition metadata are stored in Hive Metastore backed by an RDBMS and it is required to be queried to find out which directories should be read for the requested data.
Then, the directories are listed to find out the actual data files.
Hive Metastore and directory listing causes great bottlenecks with increased partition count on a table, especially given that the directory listing on object storage is actually scanning for files with a prefix and it is quite costly.
The next-generation table formats leverage metadata to do the heavy lifting. The metadata defines the table structure, partitions, and the data files that compose the table, eliminating the need to query a metastore and list directories.
Apache Iceberg
uses a snapshot approach and performs an O(1) RPC to read the snapshot file
, therefore Apache Iceberg tables can scale easily without worrying about the performance with increased partition count.