Support Questions

sudarshankumar_ · ‎11-28-2016

Please find all details here

http://stackoverflow.com/questions/40821032/last-reducer-is-running-from-last-24-hour-for-200-gb-of-...

gkeys · ‎11-28-2016

There are a couple of optimizations you can try (below) but they almost certainly will not reduce a job duration from > 24 hours to a few hours. It likely is that your cluster is too small for the amount of processing you are doing. In that case, your best bet is to break your 200GB data set into smaller chunks and bulk load each sequentially (or preferably, add more nodes to your cluster). Also, be sure that you are not bulk loading when the scheduled major compaction is occurring.

Optimizations: in addition to looking at your log, go to Ambari and see what is maxing out ... memory? CPU?

This link gives a good overview for optimizing hbase loads.

https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analy...

It is not focused on bulkloading specifically, but does still come into play.

Note: for each property mentioned, set it in your importtsv script as

-D<property>=<value> \

One thing that usually helps map-reduce jobs is compressing the map output so travels across the wire faster to the reducer

-Dmapred.compress.map.output=true\

-Dmapred.map.output.compression.code=org.apache.hadoop.io.compress.GzipCodec\

As mentioned though, it is likely that your cluster is not scaled properly for your workload.

View solution in original post

gkeys · ‎11-28-2016

There are a couple of optimizations you can try (below) but they almost certainly will not reduce a job duration from > 24 hours to a few hours. It likely is that your cluster is too small for the amount of processing you are doing. In that case, your best bet is to break your 200GB data set into smaller chunks and bulk load each sequentially (or preferably, add more nodes to your cluster). Also, be sure that you are not bulk loading when the scheduled major compaction is occurring.

Optimizations: in addition to looking at your log, go to Ambari and see what is maxing out ... memory? CPU?

This link gives a good overview for optimizing hbase loads.

https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analy...

It is not focused on bulkloading specifically, but does still come into play.

Note: for each property mentioned, set it in your importtsv script as

-D<property>=<value> \

One thing that usually helps map-reduce jobs is compressing the map output so travels across the wire faster to the reducer

-Dmapred.compress.map.output=true\

-Dmapred.map.output.compression.code=org.apache.hadoop.io.compress.GzipCodec\

As mentioned though, it is likely that your cluster is not scaled properly for your workload.

Cloudera Community

Support Questions

Reducer is running very slow in hbase bulk load

Suggestions for Bulk Loading Large Files into HBas...

HBase slow bulk loading using Hive

Ambari API - Run all service checks (bulk)

Cache Aware Load Balancer in Apache HBase

HBase increase num of reducers for bulk loading wi...

How to run YCSB for HBase

Performance metrics phoenix bulk load vs hbase bul...

Hbase bulk load help, the last reducer is taking f...

HBASE Bulk load cross-cluster replication

YARN job is Running Slow