Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

Solved Go to solution
Highlighted

Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

Running the following code:

hadoop jar /usr/hdp/current/phoenix-client/phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool 
-z <Zookeeper nodes>:2181:/hbase-unsecure  
-d $'\t' 
--g 
--table <DB>.<TBL> 
--input /data/product/inbound/<FNAME>.TXT

is there any way to skip the first line of the input file - is there a parameter on the CsvBulkLoadTool that would allow a skip row? Specifically like what Hive gives you with 'tblproperties ("skip.header.line.count"="1")'.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.

2 REPLIES 2

Re: Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.

Re: Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?

Thanks Predrag, that's what I thought. psql.py is an option for our smaller files...

Don't have an account?
Coming from Hortonworks? Activate your account here