Support Questions

yebinianhi · ‎05-05-2017

I would like some advice about the HBase schema design.

For example, there are 2000 patients,

1. Each patient has a name, sex, age, hospital_ID.

2. Each patient will be recorded activity data such as heart bits, location and steps every minute.

3. Each patient will take several questionnaires.

how to organise the HBase table?

My current idea is to use the patient_ID as the row key. each patient will have only one row in the HBase table. But, all activity data will be grouped in the nested table. The activity data table will have millions of rows.

So, the table will have three column families.

CF1:info,

CF2:activity_data,

CF3:questionnaires.

Then,

CF1:info includes (name, sex, age, ID)

CF2:activity_data (data(a nested table))

CF3:questionnaires (questionnaired_ID (a nested table))

I don't know whether this is a smart way to design the HBase schema.

Please provide me with some advice.

Thank you very much

gkeys · ‎05-05-2017

Below are the two most important aspects of HBase table design.

Key design

The most important design aspect is row key design. Proper row key design allows subsecond query results against billions of rows of data because row keys are hashed and sorted which results in fast random find and contiguous scanning of rows.

Your query pattern determines row key design. Thus if you want to query by patient, patient id is the proper row key. Keep in mind that row keys can be composite (multiple keys concatenated). For example, if you want to query by hospital but only over a range of dates, you would want a compound key of hospital_id-date. In this case, a query of a date range for a hospital will find and scan the contiguous rows of hospital-date1 to hospital-date2 and avoid scanning all other rows (even if there were a billion rows).

Column Family design

There is really no such thing as a nested table in hbase ... sometimes it is called nested entity. The main idea is really a column family. A single column family contains one or more columns. Column families must be defined at table creation time but columns can be added dynamically after table creation (if an insert statement states a column that does not exist for a column family it will create it). Column families thus can be seen as holding an array of information that may have different lengths among rows (keys). You do not have to use it that way: you can always use identical columns for each column family. Another feature of column families is that they are written to their own files. Thus queries read only the column families holding the columns in the queries. This allows you to design very wide tables (hundreds of columns) and read only a subset of columns for each query (resulting in faster performance). Also, column families can have different properties, e.g one can be compressed and others not. Thus the general rule is to group columns that will be queried together into the same column family and allow the number of columns in a column family to be dynamic among records if you wish.

Altogether a row key defines a single row or record. All rows of a table have the same number of one or more column families. Each column family can have the same or different numbers of columns among rows because new columns can be added at insert-time for a particular record, and not necessarily at table create-time (for all records).

The best reference for HBase table design is probably: http://shop.oreilly.com/product/0636920014348.do

This is a good overview of key design and scan behavior. It is for using HBase with the Phoenix SQL interface (which creates and queries native HBase tables underneath): https://www.slideshare.net/Hadoop_Summit/w-145p230-ataylorv2

View solution in original post

gkeys · ‎05-05-2017

Below are the two most important aspects of HBase table design.

Key design

The most important design aspect is row key design. Proper row key design allows subsecond query results against billions of rows of data because row keys are hashed and sorted which results in fast random find and contiguous scanning of rows.

Your query pattern determines row key design. Thus if you want to query by patient, patient id is the proper row key. Keep in mind that row keys can be composite (multiple keys concatenated). For example, if you want to query by hospital but only over a range of dates, you would want a compound key of hospital_id-date. In this case, a query of a date range for a hospital will find and scan the contiguous rows of hospital-date1 to hospital-date2 and avoid scanning all other rows (even if there were a billion rows).

Column Family design

There is really no such thing as a nested table in hbase ... sometimes it is called nested entity. The main idea is really a column family. A single column family contains one or more columns. Column families must be defined at table creation time but columns can be added dynamically after table creation (if an insert statement states a column that does not exist for a column family it will create it). Column families thus can be seen as holding an array of information that may have different lengths among rows (keys). You do not have to use it that way: you can always use identical columns for each column family. Another feature of column families is that they are written to their own files. Thus queries read only the column families holding the columns in the queries. This allows you to design very wide tables (hundreds of columns) and read only a subset of columns for each query (resulting in faster performance). Also, column families can have different properties, e.g one can be compressed and others not. Thus the general rule is to group columns that will be queried together into the same column family and allow the number of columns in a column family to be dynamic among records if you wish.

Altogether a row key defines a single row or record. All rows of a table have the same number of one or more column families. Each column family can have the same or different numbers of columns among rows because new columns can be added at insert-time for a particular record, and not necessarily at table create-time (for all records).

The best reference for HBase table design is probably: http://shop.oreilly.com/product/0636920014348.do

This is a good overview of key design and scan behavior. It is for using HBase with the Phoenix SQL interface (which creates and queries native HBase tables underneath): https://www.slideshare.net/Hadoop_Summit/w-145p230-ataylorv2

yebinianhi · ‎05-05-2017

Hi Greg,

Many thanks for your reply.

I have one question about the column family and column design.

If I have a column family called info which stores users name, age, sex and so on. And another column family store a time series activity data. for example, each patient will record 1 billion times for all 5 different features. Then, the column family "info" will be duplicated 1 billion times.

How could i avoid this problem?

Many thanks for your help in advance

gkeys · ‎05-05-2017

This is where Phoenix will be quite useful: it is a SQL interface to HBase and does joins with other tables (but they should be simple joins ... usually just one table). So you can have one table with both patient_id and time forming a composite key with columns holding data on that patient that changes with time. Then you can have a lookup table with patient_id as key and join the two whenever you need to. The link in the original answer with the slide share shows you how to build these tables. This may help also: https://phoenix.apache.org/joins.html

I am not sure of your specific query patterns but based on the info given, this should solve your problem.

yebinianhi · ‎05-05-2017

Many thanks for your help.

you really do me a big favour

I will have a look of the materials which you provided first.

thanks,

gkeys · ‎05-12-2017

@Bin Ye ... how is your follow-up going?

yebinianhi · ‎05-12-2017

@Greg Keys Hi, thanks for asking. I decided to split the table and store the data separately. Thanks.

gkeys · ‎05-12-2017

@Bin YeIf you found the answer useful, please accept or upvote ... that is how the community works 🙂

yebinianhi · ‎05-12-2017

thanks, i accpeted and voted

gkeys · ‎05-12-2017

Hi @Bin Ye Keep posting (questions, answers, articles) and sharing your experience ... everyone in the community benefits 🙂

Cloudera Community

Support Questions

HBase schema design for complex data