- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Any option to SKIP header line(s) when using the Phoenix CsvBulkLoadTool?
- Labels:
-
Apache Phoenix
Created ‎02-24-2016 08:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running the following code:
hadoop jar /usr/hdp/current/phoenix-client/phoenix-4.4.0.2.3.4.0-3485-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z <Zookeeper nodes>:2181:/hbase-unsecure -d $'\t' --g --table <DB>.<TBL> --input /data/product/inbound/<FNAME>.TXT
is there any way to skip the first line of the input file - is there a parameter on the CsvBulkLoadTool that would allow a skip row? Specifically like what Hive gives you with 'tblproperties ("skip.header.line.count"="1")'.
Thanks!
Created ‎02-24-2016 10:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.
Created ‎02-24-2016 10:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @bpreachuk, according to this page the bulk load tool doesn't have such a feature, but for smaller files, up to "tens of megabytes" you can use a single threaded psql.py tool which can interpret the first line as a list of columns by using the "-h in-line" option. Thinking about the bulk MR tool it's indeed hard to implement this because every mapper gets a chunk of the file, and we'd like only 1 mapper to remove the very first line, so it will have to be marked in a special way. More details about commands here.
Created ‎02-25-2016 02:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Predrag, that's what I thought. psql.py is an option for our smaller files...
