Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Load a csv file in PIG as tuple

avatar
New Member

I have a file like:

id,name,deg,salary,dept

1201,gopal, manager, 50000, TP

1202,manisha, proof reader, 50000, TP

I am trying to load this in PIG using tuple as below:

A = LOAD '/mydir/emp.txt' USING PigStorage(',') AS (t:tuple(a:chararray, b:chararray, c:chararray, d:chararray, e:chararray));

X = FOREACH A GENERATE t.$0, t.$1, t.$2, t.$3, t.$4;

DUMP X;

I am getting a result like :

(,,,,)

(,,,,)

Can somebody help me in understanding the reason behind this issue ?

1 ACCEPTED SOLUTION

avatar

Hi @Subhasis Roy

Tuples are used to represent complex data types. Tuples are between parentheses like in this example:

cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this

A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);

DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)

If you want to access only some fields of your data you use this (here I show only the 4 first fields):

X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000)

Does this answer your question ?

View solution in original post

4 REPLIES 4

avatar

Hi @Subhasis Roy

Tuples are used to represent complex data types. Tuples are between parentheses like in this example:

cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this

A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);

DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)

If you want to access only some fields of your data you use this (here I show only the 4 first fields):

X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000)

Does this answer your question ?

avatar
New Member

Thanks a lot for the help. However, I am now testing with your dataset and code.

dataset --

(3,8,9) (4,5,6)

(1,4,7) (3,7,5)

(2,5,8) (9,5,8)

code --

A = LOAD '/temp/test.csv' USING PigStorage('\t') As (t1:tuple(t1a:int, t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));

X = FOREACH A GENERATE t1.t1a, t2.t2a;

DUMP X;

result --

(3,)

(1,)

(2,)

Don't understand why it is not reading the 2nd tuple. Can you help ?

avatar
New Member

I think the issue was with the formatting of the data file. Problem is resolved now. Thanks a lot for the help.

avatar
Explorer

Hi,

I am facing same issue, can you please help to resolve this issue

Thanks,

Sam