Notes on HBase

April 17, 2018 Tech 本文总阅读量次

So far as I know, HBase is the first open source “table” style storage in the big data scope. It is an implementation of the BigTable paper presented by Google. If you read the paper or the reference guide, HBase does not look like a table. The paper tells you that it is a

sparse, distributed, persistent, multidimensional sorted map.

There are so many details in this definition. If you want to find out what each of these terms means, you can go through this article. After reading it, you’d rather call HBase a map, instead of a table, because the data structure is

(row key, column family:qualifier, timestamp) -> value

From the perspective of RDBM, each table in HBase could still be thought of as, for example, a table in MySQL. Data is organized into rows and each row is composed of columns (grouped into column families). A cell is specified by its row key and column name. So far, everything is the same with RDBM. You can insert rows, manipulate fields within a row, retrieve a row by its row key, and so on. The timestamp is actually an internal concept, which should not be used by the user or even put in the data model. In fact, timestamp is used as a version number. Therefore, to the user, there are still two dimensions (row and column) like in an ordinary RDBM table.

HBase has been out there for quite a few years, so it is not hard to find a good introduction about what is going on inside. This blog has really nice diagrams. It covers comprehensively some of the most important building blocks of HBase, which I’d like to elaborate a few below.

Data Organization

Each row in a HBase table is break into cells for persistence. And each cell is in fact a key-value pair, where the key is (row key, column family:qualifier, timestamp). Data is first written to a memory cache called MemStore. Once the MemStore accumulates enough data, the data is flushed into a persistent file called HFile. Both data in the MemStore and in the HFile are sorted by their keys. Each flush may generate more than one HFiles. Each row of data is break down by defined column families. For instance, if the table is with two column families, each flush will get two separate HFiles.

A row is not stored as a whole, especially if the row has been manipulated later after insertion. Therefore, each time the client read a row, HBase goes through the MemStore and maybe, a few HFiles to bring together all fields to assemble the row. This is called the read amplification problem.

HBase keeps its metadata inside a special HBase table called METADATA. The metadata maintains a set of pointers where the key is a three-value tuple [table, region start key, region id] and the value is the RegionServer. If the client wants to find a specific row key, it will go through the metadata to find the right RegionServer. Then, the RegionServer go through its managed regions to find the required data. A HFile is with multiple layers of index, bloom filters, and time range information to skip as much unnecessary data as possible.

Data Write

Writing data into HBase has three steps. Firstly, the data is written to a WAL on HDFS. Secondly, the data is written to the MemStore. Finally, when the MemStore size reaches a threshold, the data is flushed into HDFS. In fact, once the data is recorded into the WAL, the write is already successful, though the data is invisible at that moment. If the RegionServer crashes before writing to the MemStore, the data can still be loaded into MemStore once the server recovers.

HBase does not support cross row transactions. However, it does support atomic intra-row operations, and we know that a row is actually a set of key-value pairs. Therefore, I think the set of key-value pairs are grouped into a transaction.

HBase provides strong consistency for read and write, which means every read gets exactly the same result. The result depends solely on the timestamp the request is received by HBase. In the HBase architecture, strong consistency is achieved based on HDFS. In other words, the HBase provides a consistent global view to the WAL for all RegionServers.

Compaction and Split

HDFS is designed for batch process, and therefore, huge number of small files can cause a unacceptable memory footprint to the NameNode. There are two kinds of compaction, the minor compaction and the major compaction. A minor compaction collects a set of HFiles (belonging to the same column family) and merge them into one. A major compaction merge all HFiles belonging to the same column family into one. Since a major compaction consumes quite a lot of resources, it is recommended to be carried out carefully. From this point of view, HBase is not really suitable for use cases where data is manipulated frequently.

A region is a consecutive range of row keys. Once a region reaches a certain size (e.g., 1G), the region is split into two. This is to keep the size of a region from being too big. Big region is bad since region is the unit of parallal access in HBase. For instance, a mapreduce program treat each region a split. The region is split in an automatic way in default. But the user could also predefine the split boundaries when creating the table.

If the regions are split automatically, one need to be careful about the row key design. For instance, if the row key is mono-increasing numbers, the new arriving data will always be inserted to the newly split region, creating write hotspot. In generally, the row keys should be random enough to prevent both read hotspot and write hotspot.

Finally, I’d like to give my two cents on HBase:

Better documentation: The current document is not well written. The text contains a lot of references to Jira issues and external articles which is really disrupting. As discussed above, HBase is really about tables, not some complicated maps. I understand the differences, but as a beginner, I prefer concepts closer to my existing mindset.
SQL interface: Providing a SQL-like interface is much more friendly than raw Java apis. Such an interface helps users to play with HBase much more easily.
Predefined partition: Predined partition feels more controllable than the automatic split. Though there is a way of defining the split boundaries which is still quite indirect. In practice, people usually know how to partition their data to achieve the best performance. This is the well adopted use case for RDBMs.
Local filesystem: HDFS is designed for batch access which HBase can be used for random access, even with frequent manipulations. Local filesystems might be a better choice.

I also read about Kudu and MapR-DB. From the architecture point of view, they actually share a lot of design patterns. Only, Kudu and MapR-DB are built on the shoulder of HBase, so they can avoid pitfalls. Anyway, HBase is still a very good designed software which provide a very good study case. Thanks to the community and the contributors.