Created 08-08-2016 04:30 PM
Hi experts, Probably is a dummy question (but since I have 🙂 ). I want to know how Pig read the headers from the following dataset that is stored in .csv:
ID,Name,Function 1,Johnny,Student 2,Peter,Engineer 3,Cloud,Teacher 4,Angel,Consultant I want to have the first row as a Header of my file. There I need to put: A = LOAD 'file' using PIGStorage(',') as (ID:Int,....etc) ? Or I only need to put: A = LOAD 'file' using PIGStorage(',')
And only with this pache PIG already know that the first line are the headers of my table.
Thanks!
Created 08-08-2016 04:41 PM
Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.
REGISTER '/tmp/piggybank.jar'; A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray); DUMP A;
Another way to do it is:
input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray); ranked = rank input_file; NoHeader = Filter ranked by (rank_input_file > 1); New_input_file = foreach NoHeader generate row1, row2;
Created 08-08-2016 04:41 PM
Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.
REGISTER '/tmp/piggybank.jar'; A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray); DUMP A;
Another way to do it is:
input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray); ranked = rank input_file; NoHeader = Filter ranked by (rank_input_file > 1); New_input_file = foreach NoHeader generate row1, row2;
Created 08-08-2016 04:48 PM
Try to use: CSVExcelStorage instead of regular PigStorage, CSVExcelStorage has option to consider or skip the header row.
Eg: https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html