Created 02-24-2016 08:27 PM
Running the following code:
hadoop jar /usr/hdp/current/phoenix-client/phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z <Zookeeper nodes>:2181:/hbase-unsecure -d $'\t' --g --table <DB>.<TBL> --input /data/product/inbound/<FNAME>.TXT
is there any way to skip the first line of the input file - is there a parameter on the CsvBulkLoadTool that would allow a skip row? Specifically like what Hive gives you with 'tblproperties ("skip.header.line.count"="1")'.
Thanks!
Created 02-24-2016 10:52 PM
Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.
Created 02-24-2016 10:52 PM
Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.
Created 02-25-2016 02:58 AM
Thanks Predrag, that's what I thought. psql.py is an option for our smaller files...