Support Questions

Find answers, ask questions, and share your expertise

Phoenix CsvBulkLoadTool is very slow. (-Dmapreduce.job.reduces)

Explorer

Hi all

Environment is HDP 2.4 .

Sqoop transfers all rows from rdbms to hdfs approx. in 1 min.

CsvBulkLoadTool upsert all rows into table in 9 min.

There is no secondary index.

csv import path ;
hdfs dfs -ls /user/xxsqoop/TABLE1_NC2
Found 5 items
-rw-r--r--   3 bigdata hdfs            0 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/_SUCCESS
-rw-r--r--   3 bigdata hdfs  365,036,758 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00000
-rw-r--r--   3 bigdata hdfs  188,504,177 2017-04-03 19:12 /user/xxsqoop/TABLE1_NC2/part-m-00001
-rw-r--r--   3 bigdata hdfs  340,190,219 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00002
-rw-r--r--   3 bigdata hdfs  256,850,726 2017-04-03 19:13 /user/xxsqoop/TABLE1_NC2/part-m-00003


phoenix import command ;
HADOOP_CLASSPATH=/etc/hbase/conf:$(hbase mapredcp) hadoop jar 
/usr/hdp/2.4.0.0-169/phoenix/phoenix-4.4.0.2.4.0.0-169-client.jar 
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dmapreduce.job.reduces=4 
--table TB.TABLE1 --input /user/xxsqoop/TABLE1_NC2 --zookeeper lat01bigdatahwdn:2181:/hbase-unsecure  --delimiter '^A'
...
...
17/04/03 19:24:08 INFO mapreduce.HFileOutputFormat2: Configuring 1 reduce partitions to match current region count



phoenix TABLE ddl 
CREATE TABLE TB.TABLE1
(
   COL1	    DATE        NOT NULL	
  ,COL2     SMALLINT    NOT NULL     
  ,COL3     INTEGER     NOT NULL	
  ,COL4     SMALLINT    NOT NULL	  
  ,COL5     VARCHAR(8)
  ...
  ...  
  ,CONSTRAINT pk PRIMARY KEY (COL1,COL2,COL3,COL4)
)
DATA_BLOCK_ENCODING='FAST_DIFF', TTL=604800, COMPRESSION='SNAPPY';




CsvBulkLoadTool takes always 1 reducer and it makes csv import very slow.

I thought importer would get 4 reducer jobs (-Dmapreduce.job.reduces=4)

1 REPLY 1

Pre-split the table by hand or use salt-buckets which would automatically add some splits. This will intrinsically increase the number of reducers for your job.

Most kinds of bulk-load jobs into HBase will use a number of reducers equal to the number of Regions for the table. This is because the bulk load process requires files to be grouped together per-region.