Created 09-29-2016 07:25 PM
We have a 250GB CSV file that contains 60 Million records and roughly 600 columns. The file lives within HDFS currently and we are trying to ingest it into HBase and have a phoenix table on top of it.
The approach we tried so far was to create a Hive table backed by HBase and then execute an overwrite command in Hive which ingests the data in HBase.
The biggest problem we have is that the job currently takes about 3-4 days to run!! this is running on a 10 node cluster with medium spec cluster (30GB of RAM each node, and 2TB on each). Any advice on how to speed this up or different methods that can be more efficient?
Created 09-30-2016 06:15 AM
You can directly create table in phoenix and load data using CsvBulkLoadTool.
http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce
Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.
http://phoenix.apache.org/language/index.html#create_table
Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.
https://hbase.apache.org/book.html#importtsv
https://hbase.apache.org/book.html#completebulkload
Here some more configurations can be provided to mapred-site.xml to improve the job performance.
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Created 09-29-2016 07:55 PM
Created 09-30-2016 06:15 AM
You can directly create table in phoenix and load data using CsvBulkLoadTool.
http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce
Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.
http://phoenix.apache.org/language/index.html#create_table
Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.
https://hbase.apache.org/book.html#importtsv
https://hbase.apache.org/book.html#completebulkload
Here some more configurations can be provided to mapred-site.xml to improve the job performance.
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Created 10-03-2016 06:54 PM
Thanks for that detailed post, there seems to be two really good approaches there.
Which approach would likely provide better performance? It seems like the CsvBulkLoadTool might better than ImportTsv but wanted to verify.
Created 10-04-2016 05:51 AM
if phoenix schema you are going to map to HBase table have any composite primary key, data types other than strings or secondary indexes then you can use CsvBulkLoadTool otherwise you can go ahead with ImportTsv which performs better. And the remaining optimizations helps for both the cases so you can use them.
Created 10-05-2016 01:56 PM
Thanks @Rajeshbabu Chintaguntla
Created 09-30-2016 05:45 PM
It seems that your job does not use any parallelism. Among those suggested in this thread, chunking out the CSV input file in multiple parts could also help. It would do what Phoenix would do, but manually. Number of chunks should be determined based on resources that you want to use but for your cluster resources you could probably split the file in at least 25 parts of 10 GB each.
Created 10-03-2016 06:51 PM
Thanks for the insight. Based on your comment, does Phoenix chunk the data automically if we ingest it through it?
Created 12-20-2016 03:58 PM
Yes. It does. Phoenix was designed to allow that level of parallelism and data locality.