Created 08-15-2018 11:38 PM
Hello- I have HDP 3.0 with Sqoop v1.4.7.
What is the best way to migrate my data from an external RDBMS into something query-able from Phoenix? I want to make sure I import it in a way that it was have very fast queries.
Do I need to Sqoop it into HDFS first or can I go directly into HBase? It looks like the Sqoop - Phoenix is not yet completed so I believe I will need to sqoop the data into HDFS or Hbase and then connect to Phoenix. Can someone show (or point) me to how to do that?
This post makes me think that I will need to go RBDMS>HDFS>CSV>Phoenix, please tell me that is not true...
Thanks!
Created 08-16-2018 03:17 PM
In this case, The best option i see is to use sqoop and load the data into HDFS from RDBMS (this will be a parallel copy and should be fast). Then use Phoenix Bulk loading MR job to load that HDFS data into Phoenix.
Created 08-16-2018 07:00 AM
@Daniel Zafar What is the data size on RDBMS? Also is this a one time migration or will it be a scheduled one?
- if it is one time migration and the size is reasonable. you can directly export the rdbms data in csv format and then directly use it to import in Phoenix. (this approach will be simple)
- RBDMS>HDFS using sqoop and then create a hive table with phoenix storage handler (https://phoenix.apache.org/hive_storage_handler.html).
- Or RBDMS>HDFS>CSV>Phoenix (what you mentioned in description)
Created 08-16-2018 07:09 AM
Just realised we have one more approach:
- RBDMS>HBASE using sqoop and then create a table in table in phoenix (map the hbase tables in phoenix)
Created 08-16-2018 02:55 PM
Thanks so much for the advice! My table occupies 1.5 terabytes in MS SQL server. This will be a one-time migration. I feel that exporting to a csv would take days of processing time. Your last approach sounds the best but I've heard that it is not possible based on this post and this post (my table does have a float column).
That being said, what is my best option? I'm thinking that I may try to split up my table into n csv files and load them sequentially into Phoenix. Would that be the best option for data at this size?
Created 08-16-2018 03:17 PM
In this case, The best option i see is to use sqoop and load the data into HDFS from RDBMS (this will be a parallel copy and should be fast). Then use Phoenix Bulk loading MR job to load that HDFS data into Phoenix.
Created 08-16-2018 04:28 PM
@Daniel Zafar Don't forget to Accept my answer if it helped you. 🙂
Created 08-16-2018 04:42 PM
@Sandeep Nemuri I'm not sure if I follow what you are talking about. The page you pointed to shows a bulk load from CSV>Phoenix or HDFS JSON>Phoenix. Can you provide a link or command on how one would go from Sqoop's HDFS output to Phoenix directly?
Created 08-16-2018 04:52 PM
@Daniel Zafar Since Phoenix is on top of Hbase. Data will be in Hbase in anycase. It is just that Phoenix uses its encoding while writing the data into tables. More details in this thread.
Created 08-16-2018 05:03 PM
@Sandeep Nemuri I edited my above question, do you mind taking a look at it? I'm seeing a CsvBulkLoadTool and a JsonBulkLoadTool. How will I bulk load my sqoop-loaded data?
Created 08-16-2018 06:17 PM
@Daniel Zafar The doc is bit misleading, Below command should help you in loading the hdfs csv files to phoenix table. (Note that the input path is of HDFS)
HADOOP_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/etc/hbase/conf hadoop jar /usr/hdp/current/phoenix-client/phoenix-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /tmp/data.csv