Support Questions

Find answers, ask questions, and share your expertise

Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

avatar

Running the following code:

hadoop jar /usr/hdp/current/phoenix-client/phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool 
-z <Zookeeper nodes>:2181:/hbase-unsecure  
-d $'\t' 
--g 
--table <DB>.<TBL> 
--input /data/product/inbound/<FNAME>.TXT

is there any way to skip the first line of the input file - is there a parameter on the CsvBulkLoadTool that would allow a skip row? Specifically like what Hive gives you with 'tblproperties ("skip.header.line.count"="1")'.

Thanks!

1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.

View solution in original post

2 REPLIES 2

avatar
Master Guru

Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.

avatar

Thanks Predrag, that's what I thought. psql.py is an option for our smaller files...