Created 05-06-2016 10:46 PM
I have a file like:
id,name,deg,salary,dept
1201,gopal, manager, 50000, TP
1202,manisha, proof reader, 50000, TP
I am trying to load this in PIG using tuple as below:
A = LOAD '/mydir/emp.txt' USING PigStorage(',') AS (t:tuple(a:chararray, b:chararray, c:chararray, d:chararray, e:chararray));
X = FOREACH A GENERATE t.$0, t.$1, t.$2, t.$3, t.$4;
DUMP X;
I am getting a result like :
(,,,,)
(,,,,)
Can somebody help me in understanding the reason behind this issue ?
Created 05-07-2016 12:47 AM
Tuples are used to represent complex data types. Tuples are between parentheses like in this example:
cat data (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); X = FOREACH A GENERATE t1.t1a,t2.$0; DUMP X; (3,4) (1,3) (2,9)
In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this
A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray); DUMP A; (1201,gopal, manager, 50000, TP) (1202,manisha, proof reader, 50000, TP)
If you want to access only some fields of your data you use this (here I show only the 4 first fields):
X = FOREACH A GENERATE $0, $1, $2, $3; DUMP X; (1201,gopal, manager, 50000) (1202,manisha, proof reader, 50000)
Does this answer your question ?
Created 05-07-2016 12:47 AM
Tuples are used to represent complex data types. Tuples are between parentheses like in this example:
cat data (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); X = FOREACH A GENERATE t1.t1a,t2.$0; DUMP X; (3,4) (1,3) (2,9)
In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this
A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray); DUMP A; (1201,gopal, manager, 50000, TP) (1202,manisha, proof reader, 50000, TP)
If you want to access only some fields of your data you use this (here I show only the 4 first fields):
X = FOREACH A GENERATE $0, $1, $2, $3; DUMP X; (1201,gopal, manager, 50000) (1202,manisha, proof reader, 50000)
Does this answer your question ?
Created 05-07-2016 07:10 AM
Thanks a lot for the help. However, I am now testing with your dataset and code.
dataset --
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
code --
A = LOAD '/temp/test.csv' USING PigStorage('\t') As (t1:tuple(t1a:int, t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a, t2.t2a;
DUMP X;
result --
(3,)
(1,)
(2,)
Don't understand why it is not reading the 2nd tuple. Can you help ?
Created 05-07-2016 08:21 AM
I think the issue was with the formatting of the data file. Problem is resolved now. Thanks a lot for the help.
Created 03-26-2017 06:07 PM
Hi,
I am facing same issue, can you please help to resolve this issue
Thanks,
Sam