Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to perform Denormalization in Hbase ?

avatar
Expert Contributor

Hi All,

We are trying to migrate our existing RDBMS(Sql Database) system to hadoop. We are planning to use hbase for the same. But we are not getting how to denormalize sql data to store it in hbase column format.

Is it possible? If yes then what would be the best approach for that?

Which hbase version is required for this?

Any suggestions.

1 ACCEPTED SOLUTION

avatar
Guru

@Mohan V

I would:

  1. Land the data in a landing zone in hdfs. Decide to keep this going forward or not (you may want to reuse the raw data).
  2. Then use pig scripts to transform the data into your hbase tables as tab-delimited output (see next step).
  • Importantly, this involves inserting a key as the first column of your resulting tsv file. HBase of course is all about well-designed keys.
  • You will use pig's CONCAT() function to create a key from existing fields. It is often useful to concatenate fields into a key with a "-" separating each field in the resulting composite key.
  • A single tsv output will be used to bulk load a single hbase table (next step). These should be outputted to a tmp dir in hdfs to be used as input in the next step.
  • Note: you could take your pig scripting to the next level and create a single flexible pig script for creating tsv output for all hbase tables. See https://community.hortonworks.com/content/kbentry/51884/pig-doing-yoga-how-to-build-superflexible-pi.... Not necessary though.

3. Then do a bulk import into your hbase table for each tsv. See the following links on bulk imports. (Inserting record by record will be much too slow for large tables.

I have used this workflow frequently, including loading 2.53 billion relational records into a HBase table. The more you do it, the more automated you find yourself making it.

View solution in original post

1 REPLY 1

avatar
Guru

@Mohan V

I would:

  1. Land the data in a landing zone in hdfs. Decide to keep this going forward or not (you may want to reuse the raw data).
  2. Then use pig scripts to transform the data into your hbase tables as tab-delimited output (see next step).
  • Importantly, this involves inserting a key as the first column of your resulting tsv file. HBase of course is all about well-designed keys.
  • You will use pig's CONCAT() function to create a key from existing fields. It is often useful to concatenate fields into a key with a "-" separating each field in the resulting composite key.
  • A single tsv output will be used to bulk load a single hbase table (next step). These should be outputted to a tmp dir in hdfs to be used as input in the next step.
  • Note: you could take your pig scripting to the next level and create a single flexible pig script for creating tsv output for all hbase tables. See https://community.hortonworks.com/content/kbentry/51884/pig-doing-yoga-how-to-build-superflexible-pi.... Not necessary though.

3. Then do a bulk import into your hbase table for each tsv. See the following links on bulk imports. (Inserting record by record will be much too slow for large tables.

I have used this workflow frequently, including loading 2.53 billion relational records into a HBase table. The more you do it, the more automated you find yourself making it.