Support Questions

houcemalayet · ‎03-23-2017

I have two telecom csv files : one is for CDR(Call Data Records) and one for CRM(Customer Relationship Management).

The CDR file has the following columns : Anumber, ContractCode, Aoperator, Bnumber, Boperator, Direction, Type, Month, Category, numberOfCalls, duration, longitude, latitude.

The CRM file has the following columns : Anumber, age, day_c, month_c, year_c, zip_code, offerType, offer, gender.

I want to create a HBase table joining the data from the CRM table to the corresponding rows, for machine learning purposes.

Any ideas ? Thanks !

rchintaguntla · ‎03-23-2017

You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?

Ex:

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

View solution in original post

rchintaguntla · ‎03-23-2017

@Houssem Alayet is Anumber common and going to be row key for the table?

houcemalayet · ‎03-23-2017

Anumber can be found in multiple rows, it is the phone number of the caller and yes I want to use it as row key

rchintaguntla · ‎03-23-2017

You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?

Ex:

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile

houcemalayet · ‎03-23-2017

In the CDR file, a caller could make multiple calls to different numbers (Bnumber), what I want is to add the corresponding CRM information to each row and not combining them all into one. Do you see what I am after ?

rchintaguntla · ‎03-23-2017

Ya got it. Just combining that from raw files is not possible. Instead you need to create two tables for CDR data and CRM data and you can write MR job or java client based on data size with following steps.

1) Scan CDR table and get Bnumber

2) Call get on CRM table to get the corresponding details.

3) Prepared puts from the get call on CRM and add them to the ROW get from CDR and write the CDR.

or else you can use Apache Phoenix so that you can utilise UPSERT SELECT features which simplify things.

http://phoenix.apache.org/

http://phoenix.apache.org/language/index.html#upsert_select

Cloudera Community

Support Questions

Join two tables into one HBase table