Support Questions

Find answers, ask questions, and share your expertise

Join two tables into one HBase table

avatar

I have two telecom csv files : one is for CDR(Call Data Records) and one for CRM(Customer Relationship Management).

The CDR file has the following columns : Anumber, ContractCode, Aoperator, Bnumber, Boperator, Direction, Type, Month, Category, numberOfCalls, duration, longitude, latitude.

The CRM file has the following columns : Anumber, age, day_c, month_c, year_c, zip_code, offerType, offer, gender.

I want to create a HBase table joining the data from the CRM table to the corresponding rows, for machine learning purposes.

Any ideas ? Thanks !

1 ACCEPTED SOLUTION

avatar

You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?

Ex:

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

View solution in original post

5 REPLIES 5

avatar

@Houssem Alayet is Anumber common and going to be row key for the table?

avatar

Anumber can be found in multiple rows, it is the phone number of the caller and yes I want to use it as row key

avatar

You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?

Ex:

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

avatar

In the CDR file, a caller could make multiple calls to different numbers (Bnumber), what I want is to add the corresponding CRM information to each row and not combining them all into one. Do you see what I am after ?

avatar

Ya got it. Just combining that from raw files is not possible. Instead you need to create two tables for CDR data and CRM data and you can write MR job or java client based on data size with following steps.

1) Scan CDR table and get Bnumber

2) Call get on CRM table to get the corresponding details.

3) Prepared puts from the get call on CRM and add them to the ROW get from CDR and write the CDR.

or else you can use Apache Phoenix so that you can utilise UPSERT SELECT features which simplify things.

http://phoenix.apache.org/

http://phoenix.apache.org/language/index.html#upsert_select