Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Error executing DISTINCT Function in Pig

avatar
Rising Star

I am trying to execute the following Pig Script. DISTINCT is not working. Am I missing anything. Please help.

A = LOAD '/tmp/admin/data/gpa.txt' using PigStorage(',') AS (name, age, gpa); B = group A by age; C = foreach B generate ABS(SUM(A.gpa)), DISTINCT(A.name), MIN(A.gpa)+MAX(A.gpa)/2, group; dump C;

2016-01-02 04:03:21,049 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Could not resolve DISTINCT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: 
<file script.pig, line 6, column 40> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve DISTINCT using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Vidya SK

DISTINCT in pig is a relational operator.So it will apply or perform on relations rather than fields or some other.consider the following.

given_input = load '/given/path' using PigStorage(',') as (col1 ,col2,col3);

consider the following situations.

1)Suppose i want to maintain unique values in col1 then,

unique_col1 = foreach given_input generate col1;
unique_values=  DISTINCT unique_col1;  (DISTINCT only perform on relations i.e unique_col1).

suppose col1 contains data like

hortonworks
hortonworks
cloudera 

then u get

cloudera
hortonworks

2)Suppose i want to maintain unique values in col1 and col2 then

 unique_two_fields = forech given_input generate col1 ,col2;

unique_values = DISTINCT unique_two_fields; (DISTINCT only performs on relations)

suppose col1 and col2 contains data like

hortonworks,clouera
hortonworks,clouera
hortonwors,hortonworks

u get like

hortonworks,clouera
hortonwors,hortonworks

Like this we should get the data that u want to make unique in one relation and then apply the distinct operator.Suppose if u want to perform any aggregations then go for group and apply aggregations.

View solution in original post

2 REPLIES 2

avatar
Rising Star

I have included the following REGISTER statement. Still I get the above error.

register '/usr/hdp/current/pig-client/lib/piggybank.jar';

avatar
Expert Contributor

@Vidya SK

DISTINCT in pig is a relational operator.So it will apply or perform on relations rather than fields or some other.consider the following.

given_input = load '/given/path' using PigStorage(',') as (col1 ,col2,col3);

consider the following situations.

1)Suppose i want to maintain unique values in col1 then,

unique_col1 = foreach given_input generate col1;
unique_values=  DISTINCT unique_col1;  (DISTINCT only perform on relations i.e unique_col1).

suppose col1 contains data like

hortonworks
hortonworks
cloudera 

then u get

cloudera
hortonworks

2)Suppose i want to maintain unique values in col1 and col2 then

 unique_two_fields = forech given_input generate col1 ,col2;

unique_values = DISTINCT unique_two_fields; (DISTINCT only performs on relations)

suppose col1 and col2 contains data like

hortonworks,clouera
hortonworks,clouera
hortonwors,hortonworks

u get like

hortonworks,clouera
hortonwors,hortonworks

Like this we should get the data that u want to make unique in one relation and then apply the distinct operator.Suppose if u want to perform any aggregations then go for group and apply aggregations.