Pig ERROR 1066




I don't know why but pig is not running and giving me an error 1066 every time I run it. I have the data attached and this is the script I'm running.

Can anyone help?

a = load '/pigsample/Salaryinfo.csv' USING PigStorage(',');

b = load '/pigsample/Employeeinfo.csv' USING PigStorage(',');

c = filter b by $4 =='Male';

d = foreach c generate $0 as id:int, $1 as firstname:chararray, $2 as lastname:chararray, $4 as gender:chararray, $6 as city:chararray , $7 as country:chararray, $8 as countrycode:chararray;

e = foreach a generate $0 as iD:int, $1 as firstname:chararray, $2 as lastname:chararray, $3 as salary:double, ToDate($4, 'MM/dd/yyyy') as dateofhire, $5 as company:chararray;

f = join d by id, e by iD;

g = foreach f generate f.d::firstname as firstname;

dump g


this is the input and output I get from the shell with the describe and all

grunt> a = load '/pigsample/Salaryinfo.csv' USING PigStorage(',');

grunt> describe a Schema for a unknown.

grunt> b = load '/pigsample/Employeeinfo.csv' USING PigStorage(',');

grunt> describe b Schema for b unknown.

grunt> c = filter b by $4 =='Male';

2016-07-01 19:02:16,356 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

grunt> describe c

2016-07-01 19:02:21,611 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). Schema for c unknown.

grunt> d = foreach c generate $0 as id:int, $1 as firstname:chararray, $2 as lastname:chararray, $4 as gender:chararray, $6 as city:chararray , $7 as country:chararray, $8 as countrycode:chararray;

2016-07-01 19:02:35,684 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

grunt> describe d

2016-07-01 19:02:40,638 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

d: {id: int,firstname: chararray,lastname: chararray,gender: chararray,city: chararray,country: chararray,countrycode: chararray} grunt> e = foreach a generate $0 as iD:int, $1 as firstname:chararray, $2 as lastname:chararray, $3 as salary:double, ToDate($4, 'MM/dd/yyyy') as dateofhire, $5 as company:chararray;

2016-07-01 19:44:03,703 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-07-01 19:44:03,703 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s).

grunt> describe e

2016-07-01 19:44:09,159 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-07-01 19:44:09,159 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

e: {iD: int,firstname: chararray,lastname: chararray,salary: double,dateofhire: datetime,company: chararray} grunt> f = join d by id, e by iD;

2016-07-01 19:44:34,194 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).

2016-07-01 19:44:34,194 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s).

grunt> describe f

2016-07-01 19:44:38,955 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-07-01 19:44:38,955 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s).

f: {d::id: int,d::firstname: chararray,d::lastname: chararray,d::gender: chararray,d::city: chararray,d::country: chararray,d::countrycode: chararray,e::iD: int,e::firstname: chararray,e::lastname: chararray,e::salary: double,e::dateofhire: datetime,e::company: chararray}

grunt> g = foreach f generate f.d::firstname as firstname;

2016-07-01 19:45:03,037 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-07-01 19:45:03,037 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s).

grunt> describe g

2016-07-01 19:45:08,432 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2016-07-01 19:45:08,432 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s).

g: {firstname: chararray}

grunt> dump g

2016-07-01 19:45:13,698 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).

2016-07-01 19:45:13,698 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 2 time(s). 2016-07-01 19:45:13,725 [main] INFO - Pig features used in the script: HASH_JOIN,FILTER 2016-07-01 19:45:13,773 [main] INFO - Key [pig.schematuple] was not set... will not generate code. 2016-07-01 19:45:13,812 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}

2016-07-01 19:45:13,950 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2016-07-01 19:45:13,965 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - number of input files: -1 2016-07-01 19:45:13,988 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager) 2016-07-01 19:45:13,997 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3 2016-07-01 19:45:13,997 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2016-07-01 19:45:13,997 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators. 2016-07-01 19:45:13,997 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2016-07-01 19:45:14,155 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://hwhdpm 2016-07-01 19:45:14,163 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at 2016-07-01 19:45:14,376 [main] INFO - Pig script settings are added to the job 2016-07-01 19:45:14,382 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2016-07-01 19:45:14,385 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers. 2016-07-01 19:45:14,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2016-07-01 19:45:14,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=2500 2016-07-01 19:45:14,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2016-07-01 19:45:14,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process 2016-07-01 19:45:14,922 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/ to DistributedCache through /tmp/temp1278836613/tmp2008058395/pig- 2016-07-01 19:45:15,064 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/ to DistributedCache through /tmp/temp1278836613/tmp-244128717/automaton-1.11-8.jar 2016-07-01 19:45:15,193 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/ to DistributedCache through /tmp/temp1278836613/tmp-1145480432/antlr-runtime-3.4.jar 2016-07-01 19:45:15,339 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/ to DistributedCache through /tmp/temp1278836613/tmp530457831/joda-time-2.9.1.jar 2016-07-01 19:45:15,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job 2016-07-01 19:45:15,394 [main] INFO - Key [pig.schematuple] is false, will not generate code. 2016-07-01 19:45:15,395 [main] INFO - Starting process to move generated code to distributed cacche 2016-07-01 19:45:15,395 [main] INFO - Setting key [pig.schematuple.classes] with classes to deserialize [] 2016-07-01 19:45:15,498 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2016-07-01 19:45:15,624 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://hwhdpm 2016-07-01 19:45:15,625 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at 2016-07-01 19:45:15,934 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2016-07-01 19:45:16,009 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2016-07-01 19:45:16,009 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2016-07-01 19:45:16,037 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2016-07-01 19:45:16,042 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2016-07-01 19:45:16,042 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2016-07-01 19:45:16,045 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2016-07-01 19:45:16,419 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:2 2016-07-01 19:45:16,667 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1467387563416_0004 2016-07-01 19:45:16,839 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources. 2016-07-01 19:45:17,136 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1467387563416_0004 2016-07-01 19:45:17,181 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://hwhdpmaster02.c 2016-07-01 19:45:17,182 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1467387563416_0004 2016-07-01 19:45:17,182 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a,b,c,d,e,f 2016-07-01 19:45:17,182 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[1,4],e[5,4],f[6,4],b[2,4],c[3,4],d[4,4],f[6,4] C: R: 2016-07-01 19:45:17,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2016-07-01 19:45:17,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1467387563416_0004] 2016-07-01 19:45:46,336 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2016-07-01 19:45:46,336 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1467387563416_0004] 2016-07-01 19:45:47,346 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2016-07-01 19:45:47,346 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1467387563416_0004 has failed! Stop running all dependent jobs 2016-07-01 19:45:47,346 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2016-07-01 19:45:47,517 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://hwhdpm 2016-07-01 19:45:47,518 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at 2016-07-01 19:45:47,528 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server 2016-07-01 19:45:47,824 [main] ERROR - ERROR 0: java.lang.ClassCastException: cannot be cast to java.lang.Integer 2016-07-01 19:45:47,824 [main] ERROR - 1 map reduce job(s) failed! 2016-07-01 19:45:47,831 [main] INFO - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2016-07-01 19:45:14 2016-07-01 19:45:47 HASH_JOIN,FILTER Failed! Failed Jobs: JobId Alias Feature Message Outputs job_1467387563416_0004 a,b,c,d,e,f HASH_JOIN,MULTI_QUERY Message: Job failed! Input(s): Failed to read data from "/pigsample/Employeeinfo.csv" Failed to read data from "/pigsample/Salaryinfo.csv" Output(s): Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1467387563416_0004 -> null, null 2016-07-01 19:45:47,831 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2016-07-01 19:45:47,833 [main] ERROR - ERROR 1066: Unable to open iterator for alias g Details at logfile: /home//pig_1467399695184.log grunt>



Here is the solution to your problem @Dagmawi Mengistu

There are two issues over here,


If you check your logs, then after relation "f", you get the "java.lang.ClassCastException".

Please find the updated steps below with explanation of how to resolve this error( Comments are marked with // prefix) -

a = load '/pigsample/Salaryinfo.csv' USING PigStorage(',');

b = load '/pigsample/Employeeinfo.csv' USING PigStorage(',');

c = filter b by $4 =='Male';

// In relation "d", carefully observer that I have type cast the field at index 0 to int, you need to explicitly do type casting like this in order to avoid the "java.lang.ClassCastException".

d = foreach c generate (int)$0 as id:int, $1 as firstname:chararray, $2 as lastname:chararray, $4 as gender:chararray, $6 as city:chararray , $7 as country:chararray, $8 as countrycode:chararray;

// Similarly in relation "e", we have to again explicitly type cast the field iD to int.

e = foreach a generate (int)$0 as iD:int, $1 as firstname:chararray, $2 as lastname:chararray, $3 as salary:double, ToDate($4, 'MM/dd/yyyy') as dateofhire, $5 as company:chararray;

// Relation "f" works perfectly now, doesn't throw any exceptions

f = join d by id, e by iD;


// In relation "g", you don't need to write f.d::firstname, this will throw org.apache.pig.backend.executionengine.ExecException".

You can directly reference the fields present in relation "f" of relation "d" like this -

g = foreach f generate d::firstname as firstname;

// Print output















Hope this helps 🙂

View solution in original post



