Support Questions

yebinianhi · ‎05-05-2017

I would like some advice about the HBase schema design.

For example, there are 2000 patients,

1. Each patient has a name, sex, age, hospital_ID.

2. Each patient will be recorded activity data such as heart bits, location and steps every minute.

3. Each patient will take several questionnaires.

how to organise the HBase table?

My current idea is to use the patient_ID as the row key. each patient will have only one row in the HBase table. But, all activity data will be grouped in the nested table. The activity data table will have millions of rows.

So, the table will have three column families.

CF1:info,

CF2:activity_data,

CF3:questionnaires.

Then,

CF1:info includes (name, sex, age, ID)

CF2:activity_data (data(a nested table))

CF3:questionnaires (questionnaired_ID (a nested table))

I don't know whether this is a smart way to design the HBase schema.

Please provide me with some advice.

Thank you very much

gkeys · ‎05-05-2017

Below are the two most important aspects of HBase table design.

Key design

The most important design aspect is row key design. Proper row key design allows subsecond query results against billions of rows of data because row keys are hashed and sorted which results in fast random find and contiguous scanning of rows.

Your query pattern determines row key design. Thus if you want to query by patient, patient id is the proper row key. Keep in mind that row keys can be composite (multiple keys concatenated). For example, if you want to query by hospital but only over a range of dates, you would want a compound key of hospital_id-date. In this case, a query of a date range for a hospital will find and scan the contiguous rows of hospital-date1 to hospital-date2 and avoid scanning all other rows (even if there were a billion rows).

Column Family design

There is really no such thing as a nested table in hbase ... sometimes it is called nested entity. The main idea is really a column family. A single column family contains one or more columns. Column families must be defined at table creation time but columns can be added dynamically after table creation (if an insert statement states a column that does not exist for a column family it will create it). Column families thus can be seen as holding an array of information that may have different lengths among rows (keys). You do not have to use it that way: you can always use identical columns for each column family. Another feature of column families is that they are written to their own files. Thus queries read only the column families holding the columns in the queries. This allows you to design very wide tables (hundreds of columns) and read only a subset of columns for each query (resulting in faster performance). Also, column families can have different properties, e.g one can be compressed and others not. Thus the general rule is to group columns that will be queried together into the same column family and allow the number of columns in a column family to be dynamic among records if you wish.

Altogether a row key defines a single row or record. All rows of a table have the same number of one or more column families. Each column family can have the same or different numbers of columns among rows because new columns can be added at insert-time for a particular record, and not necessarily at table create-time (for all records).

The best reference for HBase table design is probably: http://shop.oreilly.com/product/0636920014348.do

This is a good overview of key design and scan behavior. It is for using HBase with the Phoenix SQL interface (which creates and queries native HBase tables underneath): https://www.slideshare.net/Hadoop_Summit/w-145p230-ataylorv2

View solution in original post

ebinjoy999 · ‎08-27-2018

Great, explained well.

Cloudera Community

Support Questions

HBase schema design for complex data