Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig Count Distinct Error

Highlighted

Pig Count Distinct Error

New Contributor

I’m trying to use pig to do a group by and count distinct on a dataset and I am getting a java error saying “java.lang.ClassCastException: org.apache.pig.data.SingleTupleBag cannot be cast to org.apache.pig.data.”

Example Dataset

A B ids flag status
foo f 1001 1 K
foo f 1001 1 K
foo c 1002 1 H
bar g 1001 1 J
bar g 1002 P
bar g 1003 1 L

Here is an example of my code

testtable = LOAD 'landing.testtable' USING org.apache.hive.hcatalog.pig.HCatLoader;

filtertable = filter testtable by flag != ' ' AND status != ‘P';

grpcount = FOREACH (GROUP filtertable by (A, B)) {

uniqueids = Distinct(filtertable.ids);

GENERATE

group.A As A_group,

group.B As B_group,

COUNT(uniqueids) AS id_count; }

STORE grpcount INTO 'landing.grpcount USING org.apache.hive.hcatalog.pig.HCatStorer();

This is where I get the error “java.lang.ClassCastException: org.apache.pig.data.SingleTupleBag cannot be cast to org.apache.pig.data.” Not exactly sure what’s wrong here (hive table is properly setup with the right datatypes as well). I assume its erroring out on grpcount but Im not exactly sure why.

But I am basically trying to duplicate this SQL Code in Pig

Select

A AS A_group,

B AS B_group,

count(distinct ids) As id_count

From landing.testtable

Where flag != ' '

And status not in (‘P')

Group by A, B;

I found this alternate solution https://issues.apache.org/jira/browse/PIG-4515 here but Im not really sure how to Implement it in my code :/. Using Pig .15/Hortonworks 2.2.0

2 REPLIES 2
Highlighted

Re: Pig Count Distinct Error

can you provide the DDL for the two Hive tables? SHOW CREATE TABLE table_name; could get it quickly

Re: Pig Count Distinct Error

Explorer

Hello,

Maybe must assign the data type to count, for example:

1) COUNT (uniqueids) AS id_count: double

2) COUNT (uniqueids) AS id_count: chararray

I usually go for option 2 because pig does not resolve well the double data type when you want to apply a filter.

Greetings and good luck

Don't have an account?
Coming from Hortonworks? Activate your account here