Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Load a csv file in PIG as tuple

avatar

I have a file like:

id,name,deg,salary,dept

1201,gopal, manager, 50000, TP

1202,manisha, proof reader, 50000, TP

I am trying to load this in PIG using tuple as below:

A = LOAD '/mydir/emp.txt' USING PigStorage(',') AS (t:tuple(a:chararray, b:chararray, c:chararray, d:chararray, e:chararray));

X = FOREACH A GENERATE t.$0, t.$1, t.$2, t.$3, t.$4;

DUMP X;

I am getting a result like :

(,,,,)

(,,,,)

Can somebody help me in understanding the reason behind this issue ?

1 ACCEPTED SOLUTION

avatar

Hi @Subhasis Roy

Tuples are used to represent complex data types. Tuples are between parentheses like in this example:

cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this

A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);

DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)

If you want to access only some fields of your data you use this (here I show only the 4 first fields):

X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000)

Does this answer your question ?

View solution in original post

4 REPLIES 4

avatar

Hi @Subhasis Roy

Tuples are used to represent complex data types. Tuples are between parentheses like in this example:

cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;

DUMP X;
(3,4)
(1,3)
(2,9)

In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this

A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);

DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)

If you want to access only some fields of your data you use this (here I show only the 4 first fields):

X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000)

Does this answer your question ?

avatar

Thanks a lot for the help. However, I am now testing with your dataset and code.

dataset --

(3,8,9) (4,5,6)

(1,4,7) (3,7,5)

(2,5,8) (9,5,8)

code --

A = LOAD '/temp/test.csv' USING PigStorage('\t') As (t1:tuple(t1a:int, t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));

X = FOREACH A GENERATE t1.t1a, t2.t2a;

DUMP X;

result --

(3,)

(1,)

(2,)

Don't understand why it is not reading the 2nd tuple. Can you help ?

avatar

I think the issue was with the formatting of the data file. Problem is resolved now. Thanks a lot for the help.

avatar
Explorer

Hi,

I am facing same issue, can you please help to resolve this issue

Thanks,

Sam