Created 04-03-2017 01:07 PM
Hi all,
I have a text file that look like (the is no header...):
Year\tName\tSalary
2015 Marc 100 2016 Marc 200 2017 Marc 300 2015 Lucy 100 2016 Lucy 200 2017 Lucy 300 2015 John 100 2016 John 200 2017 John 300
and I wanto to calculate avg salary for each employee.
By executing the following code:
a = load '/user/horton/salary' as ( year:int, name:chararray, salary:int ); b = group a by name; d = FOREACH b GENERATE group as name, AVG( a.salary ) as avgsalary; describe d; d { name: chararray, avgsalary: double ) } } dump d;
I obtained the result as aspected:
(Marc, 200 )
(Lucy, 200 )
(John, 200 )
But, when I tried the following code:
a = load '/user/horton/salary'; b = FOREACH a GENERATE $0 as year:int, $1 as name:chararray, $2 as salary:int; b { year: int, name: chararray, salary: int } c = group b by name; c { group: chararray, b { ( year: int, name: chararray, salary: int ) } } d = FOREACH c GENERATE group as name, AVG( b.salary ) as avgsalary; describe d; d { name: chararray, avgsalary: double ) } } dump d;
I have got an error:
Error 0 Exception while executing (Name: c: Local Rearrange[touple]{chararry}(false) - scope 33 Operator key: scope-33) org.apache.pig.beckend.executionengine.ExecException: ERROR while computing average Initial
Why?
Can anyone help me?
In general what is the approach whenever I have a file with a lot of fields and I cannot explicitly declare all the fields name in the LOAD phase?
Thanks.
Mauro
Created 04-06-2017 11:45 AM
This "looked" right when glanced at, so I ran your initial script fine like you did and then started one line at a time on the second script. I ran into the following error on the FOREACH / GENERATE line.
2017-04-06 11:29:05,932 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias c. Backend error : java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
So... I just explicitly casted everything as you'll see in my updated script.
a = load '/user/maria_dev/hcc/92577/salary'; b = FOREACH a GENERATE (int) $0 as year:int, (chararray) $1 as name:chararray, (int) $2 as salary:int; describe b; dump b; c = group b by name; describe c; dump c; d = FOREACH c GENERATE group as name, AVG( b.salary ) as avgsalary; describe d; dump d;
Here are the (expected) results.
b: {year: int,name: chararray,salary: int} (2015,Marc,100) (2016,Marc,200) (2017,Marc,300) (2015,Lucy,100) (2016,Lucy,200) (2017,Lucy,300) (2015,John,100) (2016,John,200) (2017,John,300) c: {group: chararray,b: {(year: int,name: chararray,salary: int)}} (John,{(2017,John,300),(2016,John,200),(2015,John,100)}) (Lucy,{(2017,Lucy,300),(2016,Lucy,200),(2015,Lucy,100)}) (Marc,{(2017,Marc,300),(2016,Marc,200),(2015,Marc,100)}) d: {name: chararray,avgsalary: double} (John,200.0) (Lucy,200.0) (Marc,200.0)
Good luck and happy Hadooping!!
Created 04-06-2017 12:31 PM
Created 04-06-2017 07:38 PM
The explicit cast (i.e. the "(int)" bit) just casts whatever datatype you initially have (bytearray in this case) to something else. The ":int" formally declares that the new field your generating needs to be that datatype. As you noticed, both will work and, in fact, my double-efforts are almost overkill in the example above, but it is something I do pretty consistently. This would be more appropriate if you were doing some kind of math or function call where you were casting something and pushing it against something of another data type and just wanted to 100% be sure of what datatype you were jamming the resulting value into.
Glad to know you are off and running again. If you think it deserves it, I hope you can "accept" my answer above so it'll get annotated with "Best Answer". Again, good luck and happy Hadooping!!
Created 04-07-2017 06:13 AM
Thanks @Lester Martin. Saved my time!!.
Created 04-10-2017 01:41 PM
Glad it did. Please check out https://martin.atlassian.net/wiki/x/AunyBQ and if you think I deserve it, kindly click the "Accept" link on my original answer. Thanks!!