Created 03-23-2017 08:41 AM
I have two telecom csv files : one is for CDR(Call Data Records) and one for CRM(Customer Relationship Management).
The CDR file has the following columns : Anumber, ContractCode, Aoperator, Bnumber, Boperator, Direction, Type, Month, Category, numberOfCalls, duration, longitude, latitude.
The CRM file has the following columns : Anumber, age, day_c, month_c, year_c, zip_code, offerType, offer, gender.
I want to create a HBase table joining the data from the CRM table to the corresponding rows, for machine learning purposes.
Any ideas ? Thanks !
Created 03-23-2017 10:32 AM
You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?
Ex:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
Created 03-23-2017 08:50 AM
@Houssem Alayet is Anumber common and going to be row key for the table?
Created 03-23-2017 10:22 AM
Anumber can be found in multiple rows, it is the phone number of the caller and yes I want to use it as row key
Created 03-23-2017 10:32 AM
You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?
Ex:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
Created 03-23-2017 10:36 AM
In the CDR file, a caller could make multiple calls to different numbers (Bnumber), what I want is to add the corresponding CRM information to each row and not combining them all into one. Do you see what I am after ?
Created 03-23-2017 10:54 AM
Ya got it. Just combining that from raw files is not possible. Instead you need to create two tables for CDR data and CRM data and you can write MR job or java client based on data size with following steps.
1) Scan CDR table and get Bnumber
2) Call get on CRM table to get the corresponding details.
3) Prepared puts from the get call on CRM and add them to the ROW get from CDR and write the CDR.
or else you can use Apache Phoenix so that you can utilise UPSERT SELECT features which simplify things.