question Re: Hbase data ingestion in Archives of Support Questions (Read Only)

Hbase data ingestion

ramym — Fri, 30 Sep 2016 02:25:18 GMT

We have a 250GB CSV file that contains 60 Million records and roughly 600 columns. The file lives within HDFS currently and we are trying to ingest it into HBase and have a phoenix table on top of it.

The approach we tried so far was to create a Hive table backed by HBase and then execute an overwrite command in Hive which ingests the data in HBase.

The biggest problem we have is that the job currently takes about 3-4 days to run!! this is running on a 10 node cluster with medium spec cluster (30GB of RAM each node, and 2TB on each). Any advice on how to speed this up or different methods that can be more efficient?

Re: Hbase data ingestion

Shelton — Fri, 30 Sep 2016 02:55:19 GMT

@Ramy Mansour

I found this interesting to read it could help !

Link

Re: Hbase data ingestion

rchintaguntla — Fri, 30 Sep 2016 13:15:55 GMT

@Ramy Mansour

You can directly create table in phoenix and load data using CsvBulkLoadTool.

http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce

Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.

http://phoenix.apache.org/language/index.html#create_table

Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.

https://hbase.apache.org/book.html#importtsv

https://hbase.apache.org/book.html#completebulkload

Here some more configurations can be provided to mapred-site.xml to improve the job performance.

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Re: Hbase data ingestion

cstanca — Sat, 01 Oct 2016 00:45:22 GMT

@Ramy Mansour

It seems that your job does not use any parallelism. Among those suggested in this thread, chunking out the CSV input file in multiple parts could also help. It would do what Phoenix would do, but manually. Number of chunks should be determined based on resources that you want to use but for your cluster resources you could probably split the file in at least 25 parts of 10 GB each.

Re: Hbase data ingestion

ramym — Tue, 04 Oct 2016 01:51:56 GMT

@Constantin Stanca

Thanks for the insight. Based on your comment, does Phoenix chunk the data automically if we ingest it through it?

Re: Hbase data ingestion

ramym — Tue, 04 Oct 2016 01:54:35 GMT

@Rajeshbabu Chintaguntla

Thanks for that detailed post, there seems to be two really good approaches there.

Which approach would likely provide better performance? It seems like the CsvBulkLoadTool might better than ImportTsv but wanted to verify.

Re: Hbase data ingestion

rchintaguntla — Tue, 04 Oct 2016 12:51:16 GMT

@Ramy Mansour

if phoenix schema you are going to map to HBase table have any composite primary key, data types other than strings or secondary indexes then you can use CsvBulkLoadTool otherwise you can go ahead with ImportTsv which performs better. And the remaining optimizations helps for both the cases so you can use them.

Re: Hbase data ingestion

ramym — Wed, 05 Oct 2016 20:56:42 GMT

Thanks @Rajeshbabu Chintaguntla

Re: Hbase data ingestion

coneal77 — Tue, 20 Dec 2016 23:58:42 GMT

Yes. It does. Phoenix was designed to allow that level of parallelism and data locality.