Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Invalid field projection error in Pig script

I have a dataset as below:

abhi,34,brown,5

john,35,green,6

amy,30,brown,6

Steve,38,blue,6

Brett,35,brown,6

Andy,34,brown,6

Layout of above data set is, Name, age, eye color, height

I want to achieve a result which shows in each age group how many people r there in total, the average height of all people in each age group and how many people are with brown eyes, black eyes and blue eyes in each age group. The result should look like below

34, 2,5.5,2,0,0

35,2,6.0,1,10

and so on..

format of above result set is,

<age>, total no of people in that age, avg height in that age group, no of brown eyes in that age group, no of green eyes in the age group, no of blue eyes in the age group.

My script is as below:

grunt> my_data = LOAD 'customers.txt' using PigStorage() >> as (name:chararray, age:int, eye_color:chararray, height:int);

grunt> my_data = FOREACH my_data>> GENERATE name, age, height,>> (eye_color == 'brown' ? 1 : 0) AS brown_eyes,>> (eye_color == 'blue' ? 1 : 0) AS blue_eyes,>> (eye_color == 'green' ? 1 : 0 ) AS green_eyes;

grunt> by_age = group my_data by age;

grunt> final_data = FOREACH by_age GENERATE >> group as age, >> COUNT(my_data) as num_people,>> AVG(my_data.height) as avg_height,>> SUM(brown_eyes) as num_brown_eyes,>> SUM(blue_eyes) as num_blue_eyes,>> SUM(green_eyes) as num_green_eyes;

I am getting below error after the last line of the script is executed:

2018-03-14 00:44:54,181 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: <line 22, column 8> Invalid field projection. Projected field [brown_eyes] does not exist in schema: group:int,my_data:bag{:tuple(name:chararray,age:int,height:int,brown_eyes:int,blue_eyes:int,green_eyes:int)}.

The schema of the by_age relation clearly shows it contains the field brown_eyes but why I am still getting this error and how can I resolve it please?



1 ACCEPTED SOLUTION

Accepted Solutions

@Abhijnan Kundu,

Can you please modify the script as below and try running

I have changed PigStorage() to PigStorage(',') and used my_data.brown_eyes instead of brown_eyes

my_data = LOAD 'customers.txt' using PigStorage(',') as (name:chararray, age:int, eye_color:chararray, height:int);
my_data = FOREACH my_data GENERATE name, age, height, (eye_color == 'brown' ? 1 : 0) AS brown_eyes, (eye_color == 'blue' ? 1 : 0) AS blue_eyes, (eye_color == 'green' ? 1 : 0 ) AS green_eyes;
by_age = group my_data by age;
final_data = FOREACH by_age GENERATE  group as age,  COUNT(my_data) as num_people, AVG(my_data.height) as avg_height, SUM(my_data.brown_eyes) as num_brown_eyes, SUM(my_data.blue_eyes) as num_blue_eyes, SUM(my_data.green_eyes) as num_green_eyes;

If this worked for you, please click on the Accept button to accept the answer. This will be helpful for other community users.

.

-Aditya

View solution in original post

5 REPLIES 5

@Abhijnan Kundu,

Can you please modify the script as below and try running

I have changed PigStorage() to PigStorage(',') and used my_data.brown_eyes instead of brown_eyes

my_data = LOAD 'customers.txt' using PigStorage(',') as (name:chararray, age:int, eye_color:chararray, height:int);
my_data = FOREACH my_data GENERATE name, age, height, (eye_color == 'brown' ? 1 : 0) AS brown_eyes, (eye_color == 'blue' ? 1 : 0) AS blue_eyes, (eye_color == 'green' ? 1 : 0 ) AS green_eyes;
by_age = group my_data by age;
final_data = FOREACH by_age GENERATE  group as age,  COUNT(my_data) as num_people, AVG(my_data.height) as avg_height, SUM(my_data.brown_eyes) as num_brown_eyes, SUM(my_data.blue_eyes) as num_blue_eyes, SUM(my_data.green_eyes) as num_green_eyes;

If this worked for you, please click on the Accept button to accept the answer. This will be helpful for other community users.

.

-Aditya

View solution in original post

Thanks a lot Aditya and Rahul! I executed the corrected code and got desired output!

(30,1,6.0,1,0,0)

(34,2,5.5,2,0,0)

(35,3,5.333333333333333,1,0,2)

(38,1,6.0,0,1,0)

Thanks again for explaining the mistake I was making and correcting it!

Thanks a lot Aditya! I executed the corrected code and got desired output!

(30,1,6.0,1,0,0)

(34,2,5.5,2,0,0)

(35,3,5.333333333333333,1,0,2)

(38,1,6.0,0,1,0)

Thanks again for explaining the mistake I was making and correcting it!

There are a couple of issues that I can see with your script.

Your first statement that reads the data from the file.

my_data = LOAD 'customers.txt'usingPigStorage()as(name:chararray, age:int, eye_color:chararray, height:int);

You used PigStorage() method without any parameter. If you don't pass any parameter to this method, it will consider TAB as the delimiter. And looking at your data file, you have a comma as the delimiter. So your LOAD statement should look like follows.

my_data = LOAD 'customers.txt'usingPigStorage(',')as(name:chararray, age:int, eye_color:chararray, height:int);

This actually is not the problem that you are facing though. In your last statement, where you are creating the final_data relation, you referred to your columns as

SUM(brown_eyes) as num_brown_eyes,SUM(blue_eyes) as num_blue_eyes SUM(green_eyes) as num_green_eyes

This is incorrect. A describe statement should explain the schema to you.

A describe statement should explain the schema for you.

grunt> describe by_age; 

by_age: {group: int,my_data: {(name: chararray,age: int,eye_color: chararray,height: int)}}

You can see that all the columns are clubbed inside my_data column. So the reference to these columns should be made as mentioned below.

SUM(my_data.brown_eyes) as num_brown_eyes,SUM(my_data.blue_eyes) as num_blue_eyesSUM(my_data.green_eyes) as num_green_eyes

The same way you have used my_data.height in your code.

So you final generate statement should look like as follows.

final_data = FOREACH by_age GENERATE groupas age, COUNT(my_data)as num_people, AVG(my_data.height)as avg_height, SUM(my_data.brown_eyes)as num_brown_eyes, SUM(my_data.blue_eyes)as num_blue_eyes, SUM(my_data.green_eyes)as num_green_eyes;

All in all, your complete script should look like as shown below.

my_data = LOAD 'customers.txt'usingPigStorage(',')as(name:chararray, age:int, eye_color:chararray, height:int);

my_data = FOREACH my_data GENERATE name, age, height,(eye_color =='brown'?1:0) AS brown_eyes,(eye_color =='blue'?1:0) AS blue_eyes,(eye_color =='green'?1:0) AS green_eyes;

by_age =group my_data by age;

final_data = FOREACH by_age GENERATE groupas age, COUNT(my_data)as num_people, AVG(my_data.height)as avg_height, SUM(my_data.brown_eyes)as num_brown_eyes, SUM(my_data.blue_eyes)as num_blue_eyes, SUM(my_data.green_eyes)as num_green_eyes;

Now you know what were the issues, you will be able to run your script and also prevent those "typos" in future!

Happy coding!

New Contributor

I executed the corrected code and got desired output!

(30,1,6.0,1,0,0)

(34,2,5.5,2,0,0)

(35,2,6.0,1,0,1)

(38,1,6.0,0,1,0).

and others showing the output is given below.

(30,1,6.0,1,0,0)

(34,2,5.5,2,0,0)

(35,3,5.333333333333333,1,0,2)

(38,1,6.0,0,1,0)

why there is a difference in the third row? appreciated if anyone explains it.