Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Apache PIG - Create a Schema or the Schema is already created?

avatar
Rising Star

Hi experts, Probably is a dummy question (but since I have 🙂 ). I want to know how Pig read the headers from the following dataset that is stored in .csv:

ID,Name,Function 1,Johnny,Student 2,Peter,Engineer 3,Cloud,Teacher 4,Angel,Consultant I want to have the first row as a Header of my file. There I need to put: A = LOAD 'file' using PIGStorage(',') as (ID:Int,....etc) ? Or I only need to put: A = LOAD 'file' using PIGStorage(',')

And only with this pache PIG already know that the first line are the headers of my table.

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru

@Pedro Rodgers

Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.

REGISTER '/tmp/piggybank.jar';

A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray);

DUMP A;

Another way to do it is:

input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Pedro Rodgers

Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row.

REGISTER '/tmp/piggybank.jar';

A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray);

DUMP A;

Another way to do it is:

input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

avatar
New Member

Try to use: CSVExcelStorage instead of regular PigStorage, CSVExcelStorage has option to consider or skip the header row.

Eg: https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html