Created 03-14-2018 01:19 AM
I have a dataset as below:
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6
Layout of above data set is, Name, age, eye color, height
I want to achieve a result which shows in each age group how many people r there in total, the average height of all people in each age group and how many people are with brown eyes, black eyes and blue eyes in each age group. The result should look like below
34, 2,5.5,2,0,0
35,2,6.0,1,10
and so on..
format of above result set is,
<age>, total no of people in that age, avg height in that age group, no of brown eyes in that age group, no of green eyes in the age group, no of blue eyes in the age group.
My script is as below:
grunt> my_data = LOAD 'customers.txt' using PigStorage() >> as (name:chararray, age:int, eye_color:chararray, height:int);
grunt> my_data = FOREACH my_data>> GENERATE name, age, height,>> (eye_color == 'brown' ? 1 : 0) AS brown_eyes,>> (eye_color == 'blue' ? 1 : 0) AS blue_eyes,>> (eye_color == 'green' ? 1 : 0 ) AS green_eyes;
grunt> by_age = group my_data by age;
grunt> final_data = FOREACH by_age GENERATE >> group as age, >> COUNT(my_data) as num_people,>> AVG(my_data.height) as avg_height,>> SUM(brown_eyes) as num_brown_eyes,>> SUM(blue_eyes) as num_blue_eyes,>> SUM(green_eyes) as num_green_eyes;
I am getting below error after the last line of the script is executed:
2018-03-14 00:44:54,181 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: <line 22, column 8> Invalid field projection. Projected field [brown_eyes] does not exist in schema: group:int,my_data:bag{:tuple(name:chararray,age:int,height:int,brown_eyes:int,blue_eyes:int,green_eyes:int)}.
The schema of the by_age relation clearly shows it contains the field brown_eyes but why I am still getting this error and how can I resolve it please?
Created 03-14-2018 06:09 AM
Can you please modify the script as below and try running
I have changed PigStorage() to PigStorage(',') and used my_data.brown_eyes instead of brown_eyes
my_data = LOAD 'customers.txt' using PigStorage(',') as (name:chararray, age:int, eye_color:chararray, height:int); my_data = FOREACH my_data GENERATE name, age, height, (eye_color == 'brown' ? 1 : 0) AS brown_eyes, (eye_color == 'blue' ? 1 : 0) AS blue_eyes, (eye_color == 'green' ? 1 : 0 ) AS green_eyes; by_age = group my_data by age; final_data = FOREACH by_age GENERATE group as age, COUNT(my_data) as num_people, AVG(my_data.height) as avg_height, SUM(my_data.brown_eyes) as num_brown_eyes, SUM(my_data.blue_eyes) as num_blue_eyes, SUM(my_data.green_eyes) as num_green_eyes;
If this worked for you, please click on the Accept button to accept the answer. This will be helpful for other community users.
.
-Aditya
Created 03-14-2018 06:09 AM
Can you please modify the script as below and try running
I have changed PigStorage() to PigStorage(',') and used my_data.brown_eyes instead of brown_eyes
my_data = LOAD 'customers.txt' using PigStorage(',') as (name:chararray, age:int, eye_color:chararray, height:int); my_data = FOREACH my_data GENERATE name, age, height, (eye_color == 'brown' ? 1 : 0) AS brown_eyes, (eye_color == 'blue' ? 1 : 0) AS blue_eyes, (eye_color == 'green' ? 1 : 0 ) AS green_eyes; by_age = group my_data by age; final_data = FOREACH by_age GENERATE group as age, COUNT(my_data) as num_people, AVG(my_data.height) as avg_height, SUM(my_data.brown_eyes) as num_brown_eyes, SUM(my_data.blue_eyes) as num_blue_eyes, SUM(my_data.green_eyes) as num_green_eyes;
If this worked for you, please click on the Accept button to accept the answer. This will be helpful for other community users.
.
-Aditya
Created 03-14-2018 08:48 PM
Thanks a lot Aditya and Rahul! I executed the corrected code and got desired output!
(30,1,6.0,1,0,0)
(34,2,5.5,2,0,0)
(35,3,5.333333333333333,1,0,2)
(38,1,6.0,0,1,0)
Thanks again for explaining the mistake I was making and correcting it!
Created 03-14-2018 11:34 PM
Thanks a lot Aditya! I executed the corrected code and got desired output!
(30,1,6.0,1,0,0)
(34,2,5.5,2,0,0)
(35,3,5.333333333333333,1,0,2)
(38,1,6.0,0,1,0)
Thanks again for explaining the mistake I was making and correcting it!
Created 03-14-2018 04:30 PM
There are a couple of issues that I can see with your script.
Your first statement that reads the data from the file.
my_data = LOAD 'customers.txt'usingPigStorage()as(name:chararray, age:int, eye_color:chararray, height:int);
You used PigStorage() method without any parameter. If you don't pass any parameter to this method, it will consider TAB as the delimiter. And looking at your data file, you have a comma as the delimiter. So your LOAD statement should look like follows.
my_data = LOAD 'customers.txt'usingPigStorage(',')as(name:chararray, age:int, eye_color:chararray, height:int);
This actually is not the problem that you are facing though. In your last statement, where you are creating the final_data relation, you referred to your columns as
SUM(brown_eyes) as num_brown_eyes,SUM(blue_eyes) as num_blue_eyes SUM(green_eyes) as num_green_eyes
This is incorrect. A describe statement should explain the schema to you.
A describe statement should explain the schema for you.
grunt> describe by_age; by_age: {group: int,my_data: {(name: chararray,age: int,eye_color: chararray,height: int)}}
You can see that all the columns are clubbed inside my_data column. So the reference to these columns should be made as mentioned below.
SUM(my_data.brown_eyes) as num_brown_eyes,SUM(my_data.blue_eyes) as num_blue_eyesSUM(my_data.green_eyes) as num_green_eyes
The same way you have used my_data.height in your code.
So you final generate statement should look like as follows.
final_data = FOREACH by_age GENERATE groupas age, COUNT(my_data)as num_people, AVG(my_data.height)as avg_height, SUM(my_data.brown_eyes)as num_brown_eyes, SUM(my_data.blue_eyes)as num_blue_eyes, SUM(my_data.green_eyes)as num_green_eyes;
All in all, your complete script should look like as shown below.
my_data = LOAD 'customers.txt'usingPigStorage(',')as(name:chararray, age:int, eye_color:chararray, height:int); my_data = FOREACH my_data GENERATE name, age, height,(eye_color =='brown'?1:0) AS brown_eyes,(eye_color =='blue'?1:0) AS blue_eyes,(eye_color =='green'?1:0) AS green_eyes; by_age =group my_data by age; final_data = FOREACH by_age GENERATE groupas age, COUNT(my_data)as num_people, AVG(my_data.height)as avg_height, SUM(my_data.brown_eyes)as num_brown_eyes, SUM(my_data.blue_eyes)as num_blue_eyes, SUM(my_data.green_eyes)as num_green_eyes;
Now you know what were the issues, you will be able to run your script and also prevent those "typos" in future!
Happy coding!
Created 12-12-2018 01:14 PM
I executed the corrected code and got desired output!
(30,1,6.0,1,0,0)
(34,2,5.5,2,0,0)
(35,2,6.0,1,0,1)
(38,1,6.0,0,1,0).
and others showing the output is given below.
(30,1,6.0,1,0,0)
(34,2,5.5,2,0,0)
(35,3,5.333333333333333,1,0,2)
(38,1,6.0,0,1,0)
why there is a difference in the third row? appreciated if anyone explains it.