Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apache PIG - Create a Schema or the Schema is already created?

avatar
Rising Star

Hi experts, Probably is a dummy question (but since I have 🙂 ). I want to know how Pig read the headers from the following dataset that is stored in .csv:

ID,Name,Function 1,Johnny,Student 2,Peter,Engineer 3,Cloud,Teacher 4,Angel,Consultant I want to have the first row as a Header of my file. There I need to put: A = LOAD 'file' using PIGStorage(',') as (ID:Int,....etc) ? Or I only need to put: A = LOAD 'file' using PIGStorage(',')

And only with this pache PIG already know that the first line are the headers of my table.

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru

@Pedro Rodgers

Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.

REGISTER '/tmp/piggybank.jar';

A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray);

DUMP A;

Another way to do it is:

input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Pedro Rodgers

Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.

REGISTER '/tmp/piggybank.jar';

A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray);

DUMP A;

Another way to do it is:

input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

avatar
New Contributor

Try to use: CSVExcelStorage instead of regular PigStorage, CSVExcelStorage has option to consider or skip the header row.

Eg: https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html