Pig Count Distinct Error


I’m trying to use pig to do a group by and count distinct on a dataset and I am getting a java error saying “java.lang.ClassCastException: cannot be cast to”

Example Dataset

A B ids flag status
foo f 1001 1 K
foo f 1001 1 K
foo c 1002 1 H
bar g 1001 1 J
bar g 1002 P
bar g 1003 1 L

Here is an example of my code

testtable = LOAD 'landing.testtable' USING org.apache.hive.hcatalog.pig.HCatLoader;

filtertable = filter testtable by flag != ' ' AND status != ‘P';

grpcount = FOREACH (GROUP filtertable by (A, B)) {

uniqueids = Distinct(filtertable.ids);


group.A As A_group,

group.B As B_group,

COUNT(uniqueids) AS id_count; }

STORE grpcount INTO 'landing.grpcount USING org.apache.hive.hcatalog.pig.HCatStorer();

This is where I get the error “java.lang.ClassCastException: cannot be cast to” Not exactly sure what’s wrong here (hive table is properly setup with the right datatypes as well). I assume its erroring out on grpcount but Im not exactly sure why.

But I am basically trying to duplicate this SQL Code in Pig


A AS A_group,

B AS B_group,

count(distinct ids) As id_count

From landing.testtable

Where flag != ' '

And status not in (‘P')

Group by A, B;

I found this alternate solution here but Im not really sure how to Implement it in my code :/. Using Pig .15/Hortonworks 2.2.0


Re: Pig Count Distinct Error

can you provide the DDL for the two Hive tables? SHOW CREATE TABLE table_name; could get it quickly

Re: Pig Count Distinct Error



Maybe must assign the data type to count, for example:

1) COUNT (uniqueids) AS id_count: double

2) COUNT (uniqueids) AS id_count: chararray

I usually go for option 2 because pig does not resolve well the double data type when you want to apply a filter.

Greetings and good luck

