Apache Hive

Hive is an open-source petabyte-level compute framework that facilitates reading, writing, and managing large datasets residing in distributed storage such as HDFS (Hadoop distributed file system) and other compatible blob stores such as Amazon S3.

While Hive was originally constructed to write MapReduce jobs, most modern implementations of Hive run on Tez, which is architecturally similar to Spark. Hive supports analysis using HiveQL, a SQL-like language and inherits all the benefits of Hadoop such as scalability, redundancy, and adeptness with large datasets.

Created by Facebook in 2008 to provide an accessible way to query their massive volume of user-generated data, Hive is the oldest and most mature of all of the SQL on Hadoop engines available. Thus Hive is the preferred choice for organizations looking for the most stable SQL on Hadoop Engine.

Apache Hive Considerations

Stability

First released in 2008, Hive is the most stable and mature SQL on Hadoop engine by five years, and is still being developed and improved today. This makes Hive the ideal choice for organizations interested in

Query Throughput

Hive is capable of joining extremely large (billion-row) tables together easily. While faster in-memory SQL engines sometimes fail for extremely large fact-table to fact-table queries, Hive’s processing engine is able to join these tables together easily. Modern version of Hive on Tez also no longer need to make a tradeoff between high throughput and fast queries.

Strong Developer Community

One big advantage that Hive has is its grounding in the open source community, and a broad-base of community support. One example of this support was the Stinger Initiative, which was an movement across the Hive developer community to bring interactive querying and additional functionality to Hive, resulting in the creation of Tez, a new execution engine for Hive.

What uses cases is Hive really great for?

Batch and ETL processing

Yahoo! JAPAN tested batch execution of a random sample of 2,000 SQL queries on Hive and Impala, and the results are pretty astounding. The vast majority of the 2,000 queries (around 1,800) returned in under 20 seconds, while the return speed had a wider distribution. The fact that Hive supported massive parallelism with batch processes made it a clear winner in that benchmark competition.

Hive Architecture

Apache Hive was originally constructed to translate SQL queries into MapReduce processes, but due to the large latency associated with using MapReduce to execute queries, modern Hive runs on Tez, which is an application framework that avoids writing intermediate operations back to disk.

The team at Spry performed a benchmarking test of Hive on MapReduce vs Hive on Tez vs two other SQL on Hadoop engines, and Hive on Tez performed at least as good, and in some cases better than the fastest query engine against several different types of queries.

Types of data

Hive can process structured, semi-structured (JSON, XML, logs, etc), and unstructured data. A SerDe (serializer, deserializer) may be necessary to instruct Hive on how to process semi- or unstructured data.

Range of row (quantity)

Hive is designed to write enormous queries to handle massive amounts of data. As of 2014, Facebook was storing upwards of 300 PB of Hive data, with 600TB of data being generated every day.

Hadoop Hive Use & Management

Once data is loaded into HDFS or another distributed file system, you can create Hive tables over the data. There are two types of tables in Hive, external and internal (managed), which enables you to control how the data is loaded, controlled, and managed. The main difference between the two is that when you drop an internal table, both the metadata and underlying data are deleted. If you drop an external table, only the metadata is removed, but the underlying data still exists on HDFS.

A CREATE TABLE statement creates and defines a new table in Hive. The CREATE statement can include a PARTITIONED BY clause to define the partitioning columns. Partitioning directs Hive how to structure the data storage. Each partition will create a subdirectory that stores the data for that particular partition. In addition to partitioning, Hive also supports bucketing, organizing the data in each partition into multiple files, with the CLUSTERED BY clause. Bucketing can improve performance for certain types of joins; it works best on fields with high cardinality.

Once you have created a table, you can either load data directly from a file using the LOAD DATA command or insert it from a query using INSERT.

Maintenance

One of the quickest wins that you can have for your Hive cluster is upgrading to Apache Tez if you haven’t done so already. Other techniques include using different file types to optimize compression and reads, and refactoring your table architecture to maximize query performance.