Support Questions

Find answers, ask questions, and share your expertise

Best practice to import data into HBase/Phoenix?

avatar
Contributor

Hello- I have HDP 3.0 with Sqoop v1.4.7.

What is the best way to migrate my data from an external RDBMS into something query-able from Phoenix? I want to make sure I import it in a way that it was have very fast queries.

Do I need to Sqoop it into HDFS first or can I go directly into HBase? It looks like the Sqoop - Phoenix is not yet completed so I believe I will need to sqoop the data into HDFS or Hbase and then connect to Phoenix. Can someone show (or point) me to how to do that?

This post makes me think that I will need to go RBDMS>HDFS>CSV>Phoenix, please tell me that is not true...

Thanks!

1 ACCEPTED SOLUTION

avatar

In this case, The best option i see is to use sqoop and load the data into HDFS from RDBMS (this will be a parallel copy and should be fast). Then use Phoenix Bulk loading MR job to load that HDFS data into Phoenix.

View solution in original post

9 REPLIES 9

avatar

@Daniel Zafar What is the data size on RDBMS? Also is this a one time migration or will it be a scheduled one?

- if it is one time migration and the size is reasonable. you can directly export the rdbms data in csv format and then directly use it to import in Phoenix. (this approach will be simple)

- RBDMS>HDFS using sqoop and then create a hive table with phoenix storage handler (https://phoenix.apache.org/hive_storage_handler.html).

- Or RBDMS>HDFS>CSV>Phoenix (what you mentioned in description)

avatar

Just realised we have one more approach:

- RBDMS>HBASE using sqoop and then create a table in table in phoenix (map the hbase tables in phoenix)

avatar
Contributor

@Sandeep Nemuri

Thanks so much for the advice! My table occupies 1.5 terabytes in MS SQL server. This will be a one-time migration. I feel that exporting to a csv would take days of processing time. Your last approach sounds the best but I've heard that it is not possible based on this post and this post (my table does have a float column).

That being said, what is my best option? I'm thinking that I may try to split up my table into n csv files and load them sequentially into Phoenix. Would that be the best option for data at this size?

avatar

In this case, The best option i see is to use sqoop and load the data into HDFS from RDBMS (this will be a parallel copy and should be fast). Then use Phoenix Bulk loading MR job to load that HDFS data into Phoenix.

avatar

@Daniel Zafar Don't forget to Accept my answer if it helped you. 🙂

avatar
Contributor

@Sandeep Nemuri I'm not sure if I follow what you are talking about. The page you pointed to shows a bulk load from CSV>Phoenix or HDFS JSON>Phoenix. Can you provide a link or command on how one would go from Sqoop's HDFS output to Phoenix directly?

avatar

@Daniel Zafar Since Phoenix is on top of Hbase. Data will be in Hbase in anycase. It is just that Phoenix uses its encoding while writing the data into tables. More details in this thread.

avatar
Contributor

@Sandeep Nemuri I edited my above question, do you mind taking a look at it? I'm seeing a CsvBulkLoadTool and a JsonBulkLoadTool. How will I bulk load my sqoop-loaded data?

avatar

@Daniel Zafar The doc is bit misleading, Below command should help you in loading the hdfs csv files to phoenix table. (Note that the input path is of HDFS)

HADOOP_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/etc/hbase/conf hadoop jar /usr/hdp/current/phoenix-client/phoenix-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /tmp/data.csv