Support Questions

Find answers, ask questions, and share your expertise

Error executing DISTINCT Function in Pig

avatar
Rising Star

I am trying to execute the following Pig Script. DISTINCT is not working. Am I missing anything. Please help.

A = LOAD '/tmp/admin/data/gpa.txt' using PigStorage(',') AS (name, age, gpa); B = group A by age; C = foreach B generate ABS(SUM(A.gpa)), DISTINCT(A.name), MIN(A.gpa)+MAX(A.gpa)/2, group; dump C;

2016-01-02 04:03:21,049 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Could not resolve DISTINCT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: 
<file script.pig, line 6, column 40> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve DISTINCT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Vidya SK

DISTINCT in pig is a relational operator.So it will apply or perform on relations rather than fields or some other.consider the following.

given_input = load '/given/path' using PigStorage(',') as (col1 ,col2,col3);

consider the following situations.

1)Suppose i want to maintain unique values in col1 then,

unique_col1 = foreach given_input generate col1;
unique_values=  DISTINCT unique_col1;  (DISTINCT only perform on relations i.e unique_col1).

suppose col1 contains data like

hortonworks
hortonworks
cloudera 

then u get

cloudera
hortonworks

2)Suppose i want to maintain unique values in col1 and col2 then

 unique_two_fields = forech given_input generate col1 ,col2;

unique_values = DISTINCT unique_two_fields; (DISTINCT only performs on relations)

suppose col1 and col2 contains data like

hortonworks,clouera
hortonworks,clouera
hortonwors,hortonworks

u get like

hortonworks,clouera
hortonwors,hortonworks

Like this we should get the data that u want to make unique in one relation and then apply the distinct operator.Suppose if u want to perform any aggregations then go for group and apply aggregations.

View solution in original post

2 REPLIES 2

avatar
Rising Star

I have included the following REGISTER statement. Still I get the above error.

register '/usr/hdp/current/pig-client/lib/piggybank.jar';

avatar
Expert Contributor

@Vidya SK

DISTINCT in pig is a relational operator.So it will apply or perform on relations rather than fields or some other.consider the following.

given_input = load '/given/path' using PigStorage(',') as (col1 ,col2,col3);

consider the following situations.

1)Suppose i want to maintain unique values in col1 then,

unique_col1 = foreach given_input generate col1;
unique_values=  DISTINCT unique_col1;  (DISTINCT only perform on relations i.e unique_col1).

suppose col1 contains data like

hortonworks
hortonworks
cloudera 

then u get

cloudera
hortonworks

2)Suppose i want to maintain unique values in col1 and col2 then

 unique_two_fields = forech given_input generate col1 ,col2;

unique_values = DISTINCT unique_two_fields; (DISTINCT only performs on relations)

suppose col1 and col2 contains data like

hortonworks,clouera
hortonworks,clouera
hortonwors,hortonworks

u get like

hortonworks,clouera
hortonwors,hortonworks

Like this we should get the data that u want to make unique in one relation and then apply the distinct operator.Suppose if u want to perform any aggregations then go for group and apply aggregations.