Support Questions

Find answers, ask questions, and share your expertise

HBase: Composite key for ImportTsv

avatar
Rising Star

Hi dear experts!

i'm trying to load data from CSV format on HDFS to HBase with ImportTSV (importtsv).

it works perfectly fine in case when HBASE_ROW_KEY is the single CSV column.

but i don't know how to create composite HBASE_ROW_KEY (from two columns).

for example, i have CSV with 3 columns:

 

row1, 1, abc
row1, 2, dd
row2, 1, iop
row3, 1, kk

and row could be uniqly identified by first two columns.

 

any inputs will be highly appreciated!

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
ImportTSV is not capable of this now. One would have to preprocess the columns first to create the composite key in a new CSV file and then use ImportTSV to import the new CSV file which contains the composite key column and it's data.

View solution in original post

5 REPLIES 5

avatar
Mentor
The ImportTSV is a simple utility and does not currently support this.

Perhaps you can take a look at Kite SDK's HBase and CSV dataset handling capabilities, which are capable of these tasks (although it uses the more efficient Avro encoding instead of plaintext during serialisation). Read more at http://kitesdk.org/docs/1.1.0/

avatar
Mentor
Of course, another easier way to use ImportTSV itself, is to re-transform your CSV input via a custom mapper (passed via configuration key "importtsv.mapper.class"), and "merge" the two rows together before the CSV parser maps them into the designated fields.

This is the default Map class for ImportTSV, for reference: https://github.com/cloudera/hbase/blob/cdh5.7.0-release/hbase-server/src/main/java/org/apache/hadoop...

avatar
Cloudera Employee
ImportTSV is not capable of this now. One would have to preprocess the columns first to create the composite key in a new CSV file and then use ImportTSV to import the new CSV file which contains the composite key column and it's data.

avatar
New Contributor

Hi! Sorry for digging out this thread, but I am currently facing the same problem and decided to run MR job to transform my data before importing it.

 

However, I am unsure what the data output should look like for it to be understood by HBase. As far as I know, HBase saves everything as bytes anyway, but makes a difference for timestamps. So,say I want to queue Factory_ID:YYYMMDD:Order_ID:UID for my composite key. Should I output them with ":" as a separator. Or just one after another? Will HBase be able to use this information to shard the table into different regions?

 

Thanks in advance!

avatar
Mentor
You are right that its all just byte sequences to HBase, and that it sorts
everything lexicographically. You do not require a separator character when
composing your key for HBase to understand them as boundaries (cause it
would not serve as one), unless you prefer the extra bytes for better
readability or for recovering back the individual data elements from
(variable length) keys if that's a use-case.

HBase 'sharding' (splitting) can be manually specified at table create time
if you are aware of your key pattern and ranges - this is strongly
recommended to scale from the beginning. Otherwise, HBase computes key
midpoints by analysing them in byte form and splits them based on that,
whenever a split size threshold is reached for a region range.