- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Join two tables into one HBase table
- Labels:
-
Apache HBase
Created ‎03-23-2017 08:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have two telecom csv files : one is for CDR(Call Data Records) and one for CRM(Customer Relationship Management).
The CDR file has the following columns : Anumber, ContractCode, Aoperator, Bnumber, Boperator, Direction, Type, Month, Category, numberOfCalls, duration, longitude, latitude.
The CRM file has the following columns : Anumber, age, day_c, month_c, year_c, zip_code, offerType, offer, gender.
I want to create a HBase table joining the data from the CRM table to the corresponding rows, for machine learning purposes.
Any ideas ? Thanks !
Created ‎03-23-2017 10:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?
Ex:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
Created ‎03-23-2017 08:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Houssem Alayet is Anumber common and going to be row key for the table?
Created ‎03-23-2017 10:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anumber can be found in multiple rows, it is the phone number of the caller and yes I want to use it as row key
Created ‎03-23-2017 10:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can create a table with some column family of interest and run the ImportTsv on the both the files separately by specifying the Anumber field as HBASE_ROW_KEY then columns in the same Anumber will be combined into single row. Is this something you are looking?
Ex:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:age,cf:day_c,cf:month_c,cf:year_c,cf:zip_code,cf:offerType,cf:offer,cf:gender -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:ContractCode,cf:Aoperator,cf:Bnumber, cf:Boperator, cf:Direction, cf:Type, cf:Month, cf:Category, cf:numberOfCalls, cf:duration, cf:longitude, cf:latitude -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
Created ‎03-23-2017 10:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the CDR file, a caller could make multiple calls to different numbers (Bnumber), what I want is to add the corresponding CRM information to each row and not combining them all into one. Do you see what I am after ?
Created ‎03-23-2017 10:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ya got it. Just combining that from raw files is not possible. Instead you need to create two tables for CDR data and CRM data and you can write MR job or java client based on data size with following steps.
1) Scan CDR table and get Bnumber
2) Call get on CRM table to get the corresponding details.
3) Prepared puts from the get call on CRM and add them to the ROW get from CDR and write the CDR.
or else you can use Apache Phoenix so that you can utilise UPSERT SELECT features which simplify things.
