Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Phoenix CsvBulkLoadTool is very slow. (-Dmapreduce.job.reduces)

Explorer

Hi all

Environment is HDP 2.4 .

Sqoop transfers all rows from rdbms to hdfs approx. in 1 min.

CsvBulkLoadTool upsert all rows into table in 9 min.

There is no secondary index.

csv import path ;
hdfs dfs -ls /user/xxsqoop/TABLE1_NC2
Found 5 items
-rw-r--r--   3 bigdata hdfs            0 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/_SUCCESS
-rw-r--r--   3 bigdata hdfs  365,036,758 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00000
-rw-r--r--   3 bigdata hdfs  188,504,177 2017-04-03 19:12 /user/xxsqoop/TABLE1_NC2/part-m-00001
-rw-r--r--   3 bigdata hdfs  340,190,219 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00002
-rw-r--r--   3 bigdata hdfs  256,850,726 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00003


phoenix import command ;
HADOOP_CLASSPATH=/etc/hbase/conf:$(hbase mapredcp) hadoop jar 
/usr/hdp/2.4.0.0-169/phoenix/phoenix-4.4.0.2.4.0.0-169-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dmapreduce.job.reduces=4 
--table TB.TABLE1 --input /user/xxsqoop/TABLE1_NC2 --zookeeper lat01bigdatahwdn:2181:/hbase-unsecure  --delimiter '^A'
...
...
17/04/03 19:24:08 INFO mapreduce.HFileOutputFormat2: Configuring 1 reduce partitions to match current region count



phoenix TABLE ddl 
CREATE TABLE TB.TABLE1
(
   COL1	    DATE        NOT NULL	
  ,COL2     SMALLINT    NOT NULL     
  ,COL3     INTEGER     NOT NULL	
  ,COL4     SMALLINT    NOT NULL	  
  ,COL5     VARCHAR(8)
  ...
  ...  
  ,CONSTRAINT pk PRIMARY KEY (COL1,COL2,COL3,COL4)
)
DATA_BLOCK_ENCODING='FAST_DIFF', TTL=604800, COMPRESSION='SNAPPY';




CsvBulkLoadTool takes always 1 reducer and it makes csv import very slow.

I thought importer would get 4 reducer jobs (-Dmapreduce.job.reduces=4)

1 REPLY 1

Pre-split the table by hand or use salt-buckets which would automatically add some splits. This will intrinsically increase the number of reducers for your job.

Most kinds of bulk-load jobs into HBase will use a number of reducers equal to the number of Regions for the table. This is because the bulk load process requires files to be grouped together per-region.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.