Support Questions

Find answers, ask questions, and share your expertise

Hbase data ingestion

Explorer

We have a 250GB CSV file that contains 60 Million records and roughly 600 columns. The file lives within HDFS currently and we are trying to ingest it into HBase and have a phoenix table on top of it.

The approach we tried so far was to create a Hive table backed by HBase and then execute an overwrite command in Hive which ingests the data in HBase.

The biggest problem we have is that the job currently takes about 3-4 days to run!! this is running on a 10 node cluster with medium spec cluster (30GB of RAM each node, and 2TB on each). Any advice on how to speed this up or different methods that can be more efficient?

1 ACCEPTED SOLUTION

@Ramy Mansour

You can directly create table in phoenix and load data using CsvBulkLoadTool.

http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce

Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.

http://phoenix.apache.org/language/index.html#create_table

Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.

https://hbase.apache.org/book.html#importtsv

https://hbase.apache.org/book.html#completebulkload

Here some more configurations can be provided to mapred-site.xml to improve the job performance.

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

View solution in original post

8 REPLIES 8

Mentor

@Ramy Mansour

I found this interesting to read it could help !

Link

@Ramy Mansour

You can directly create table in phoenix and load data using CsvBulkLoadTool.

http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce

Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.

http://phoenix.apache.org/language/index.html#create_table

Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.

https://hbase.apache.org/book.html#importtsv

https://hbase.apache.org/book.html#completebulkload

Here some more configurations can be provided to mapred-site.xml to improve the job performance.

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Explorer

@Rajeshbabu Chintaguntla

Thanks for that detailed post, there seems to be two really good approaches there.

Which approach would likely provide better performance? It seems like the CsvBulkLoadTool might better than ImportTsv but wanted to verify.

@Ramy Mansour

if phoenix schema you are going to map to HBase table have any composite primary key, data types other than strings or secondary indexes then you can use CsvBulkLoadTool otherwise you can go ahead with ImportTsv which performs better. And the remaining optimizations helps for both the cases so you can use them.

Explorer

@Ramy Mansour

It seems that your job does not use any parallelism. Among those suggested in this thread, chunking out the CSV input file in multiple parts could also help. It would do what Phoenix would do, but manually. Number of chunks should be determined based on resources that you want to use but for your cluster resources you could probably split the file in at least 25 parts of 10 GB each.

Explorer

@Constantin Stanca

Thanks for the insight. Based on your comment, does Phoenix chunk the data automically if we ingest it through it?

Explorer

Yes. It does. Phoenix was designed to allow that level of parallelism and data locality.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.