It's super easy to get lost in the world of big data technologies. There are so many of them that it seems a day never passes without the advent of a new one. Still, such fast development is only half the trouble. The real problem is that it's difficult to understand the functionality and the intended use of the existing technologies.
To find out what technology suits their needs, IT managers often contrast them. We've also conducted an academic study to make a clear distinction between Apache Hive and Apache HBase—two important technologies that are frequently used in Hadoop implementation projects.
Data model comparison
Apache Hive's data model
To understand Apache Hive's data model, you should get familiar with its three main components: a table, a partition, and a bucket.
Hive's table doesn't differ a lot from a relational database table (the main difference is that there are no relations between the tables). Hive's tables can be managed or external. To understand the difference between these two types, let's look at the load data and drop a table operations. When you load data into a managed table, you actually move the data from Hadoop Distributed File System's (HDFS) inner data structures into the Hive directory (which is also in HDFS). And when you drop such a table, you delete the data it contains from the directory. In the case of external tables, Hive doesn't load the data into the Hive directory but creates a "ghost-table" that indicates where actual data is physically stored in HDFS. So, when you drop an external table, the data is not affected.
Both managed and external tables can be further broken down to partitions. A partition represents the rows of the table grouped together based on a partition key. Each partition is stored as a separate folder in the Hive directory. For instance, the table below can be partitioned based on a country, and the rows for each country will be stored together. Of course, this example is simplified. In real life, you'll deal with more than three partitions and much more than four rows in each, and partitioning will help you significantly reduce your partition key query execution time.
|Customer ID||Country||State/Province||City||Gender||Family status||…|
You can break your data further into buckets, which are even easier to manage and enable faster query execution. Let's take the partition with the US data from our previous example and cluster it into buckets based on the Customer ID column. When you specify the number of buckets, Hive applies a hash function to the chosen column, which assigns a hash value to each row in the partition and then "packs" the rows into a certain number of buckets. So, if we have 10 million Customer IDs in the partition and specify the number of buckets as 50, each bucket will contain about 200,000 rows. As a result, if you need to find the data about a particular customer, Hive will directly go to the relevant bucket to find the info.
Apache HBase's data model
HBase also stores data in tables. The cells in an HBase table are organized by row keys and column families. Each column family has a set of storage properties (for example, row keys encryption and data compression rules). In addition, there are column qualifiers to ease data management. Neither row keys nor column qualifiers have a data type assigned (they are always treated as bytes).
|Customer ID||Country||State||City||Gender||Family status||…|
Every cell has a timestamp, or, in other words, bears the mark of when it was created. This info is crucial during the read operations, as it allows identifying the most recent (and therefore more up-to-date) data versions. You can specify a timestamp during a write operation, otherwise, HBase gives the cell a current timestamp automatically.
Data in a table is lexicographically sorted based on row keys, and to store closely related data together, a developer needs to design a good algorithm for row key composition.
As to partitioning, HBase does it automatically based on the row keys. Still, you can manage the process by changing the start and end row keys for each partition.
Key takeaways on data models
- Both Hive and HBase are capable of organizing data in a way to enable quick access to the required data and reduce query execution time (though their approach to partitioning is different).
- Both Hive and HBase act as data management agents. When somebody says that Hive or HBase stores data, it really means the data is stored in a data store (usually in HDFS). This means the success of your Hadoop endeavor goes beyond either/or technology choices and strongly depends on other important factors, such as calculating the required cluster size correctly and integrating all the architectural components seamlessly.
Hive as an analytical query engine
Hive is specifically designed to enable data analytics. To successfully perform this task, it uses its dedicated Hive Query Language (HiveQL), which is very similar to analytics-tuned SQL.
Initially, Hive converted HiveQL queries into Hadoop MapReduce jobs, simplifying the lives of developers who could bypass more complicated MapReduce code. Running queries in Hive usually took some time, since Hive scanned all the available data sets, if not specified otherwise. It was possible to limit the volume of scanned data by specifying the partitions and buckets that Hive had to address. Anyway, that was batch processing. Nowadays, Apache Hive is also able to convert queries into Apache Tez or Apache Spark jobs.
The earliest versions of Hive did not provide record-level updates, inserts, and deletes, which was one of the most serious limitations in Hive. This functionality appeared only in version 0.14.0 (though with some constraints: for example, your table's file format should be ORC).
HBase as a data manager that supports queries
Being a data manager, HBase alone is not intended for analytical queries. It doesn't have a dedicated query language. To run CRUD (create, read, update, and delete) and search queries, it has a JRuby-based shell, which offers simple data manipulation possibilities, such as Get, Put, and Scan. For the first two operations, you should specify the row key, while scans run over a whole range of rows.
HBase's primary purpose is to offer a random data input/output for HDFS. At the same time, one can surely say that HBase contributes to fast analytics by enabling consistent reads. This is possible due to the fact that HBase writes data to only one server, which doesn't require comparing multiple data versions from different nodes. Besides, HBase handles append operations very well. It also enables updates and deletes, but copes with these two not so perfectly.
In Hive 3.0.0, indexing was removed. Prior to that, it was possible to create indexes on columns, though the advantages of faster queries should have been weighted against the cost of indexing during write operations and extra space for storing the indexes. Anyway, Hive's data model, with its ability to group data into buckets (which can be created for any column, not only for the keyed one), offers an approach similar to the one that indexing provides.
HBase enables multi-layered indexing. But again, you have to think about the trade-off between gaining read query response vs. slower writes and the costs associated with storing indexes.
Key takeaways on query performance
- Running analytical queries is exactly the task for Hive. HBase's initial task is to ingest data as well as run CRUD and search queries.
- While HBase handles row-level updates, deletes, and inserts well, the Hive community is working to eliminate this stumbling block.
To sum it up
There are many similarities between Hive and HBase. Both are data management agents, and both are strongly interconnected with HDFS. The main difference between these two is that HBase is tailored to perform CRUD and search queries while Hive does analytical ones. These two technologies complement each other and are frequently used together in Hadoop consulting projects so businesses can make the most of both applications' strengths.