Support Questions

pooja_kurwai · ‎11-14-2016

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)

Hi Gurus,

I am new bee,trying to do some hands on using https://pig.apache.org/docs/r0.15.0/basic.html#register

When I try to replicate above example it gives a different output for me. see below

grunt> P= LOAD '/PoojaSahu/student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

grunt> X= FOREACH P GENERATE name,$2;

grunt> DUMP X;

(John,18,4.0F,)

(Mary,19,3.8F,)

(Bill,20,3.9F,)

(Joe,18,3.8F,)

File have data like below:

hadoop fs -cat /PoojaSahu/student

16/11/14 14:38:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable John,18,4.0F Mary,19,3.8F Bill,20,3.9F Joe,18,3.8F

However If I use below script it give the same ouput as said in example:

grunt> P= LOAD '/PoojaSahu/student' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);

Could you please let me know what is the purpose of PigStorage and wwhat I am doing wrong when I try to replicate the example as per recommendation in Link.

Many Thanks,

Pooja Sahu

gkeys · ‎11-14-2016

PigStorage

The PigStorage needs to know the delimiter of your fields. The default delimiter is tab, which is used when you use PigStorage(). You can specify the delimiter, like you did when you used PigStorage(',').

If your file is comma-delim and you use PigStorage() it will ignore the commas and see only one field (because it cannot find a tab) ... the commas just happen to be characters in a string.

By correctly specifying PigStorage(','), it breaks each line into fields separated by the comma.

https://pig.apache.org/docs/r0.9.1/func.html#pigstorage

Register

The link you mention (https://pig.apache.org/docs/r0.15.0/basic.html#register) is to register UDFs. There are two types of functions in pig: native functions and user define functions (UDFs). Native functions come native to the pig binaries and you do not need to do anything to call them. UDFs you build yourself, into a jar file, and register them so the script can find them. Since PigStorage is a native function, you do not need to register them ... pig will find them. (Thus the link is not relevant to your script).

If this is what you were looking for, please let me know by accepting the answer; else, let me know of remaining gaps.

View solution in original post

gkeys · ‎11-14-2016

PigStorage

The PigStorage needs to know the delimiter of your fields. The default delimiter is tab, which is used when you use PigStorage(). You can specify the delimiter, like you did when you used PigStorage(',').

If your file is comma-delim and you use PigStorage() it will ignore the commas and see only one field (because it cannot find a tab) ... the commas just happen to be characters in a string.

By correctly specifying PigStorage(','), it breaks each line into fields separated by the comma.

https://pig.apache.org/docs/r0.9.1/func.html#pigstorage

Register

The link you mention (https://pig.apache.org/docs/r0.15.0/basic.html#register) is to register UDFs. There are two types of functions in pig: native functions and user define functions (UDFs). Native functions come native to the pig binaries and you do not need to do anything to call them. UDFs you build yourself, into a jar file, and register them so the script can find them. Since PigStorage is a native function, you do not need to register them ... pig will find them. (Thus the link is not relevant to your script).

If this is what you were looking for, please let me know by accepting the answer; else, let me know of remaining gaps.

Cloudera Community

Support Questions

Pig Script ,PigStorage function