Support Questions
Find answers, ask questions, and share your expertise

Pig dump command causes pig script to run forever

Pig dump command causes pig script to run forever

Explorer

I have installed a 5 node cluster using ambari using Quickstart (centos7)

https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+for+New+VM+Users

I am running the tutorial "How to Process Data with Apache Pig" and when I add the line

dump join_data;

to 
drivers = LOAD 'drivers.csv' USING PigStorage(',');
raw_drivers = FILTER drivers BY $0>1;
drivers_details = FOREACH raw_drivers GENERATE $0 AS driverId, $1 AS name;
timesheet = LOAD 'timesheet.csv' USING PigStorage(',');
raw_timesheet = FILTER timesheet by $0>1;
timesheet_logged = FOREACH raw_timesheet GENERATE $0 AS driverId, $2 AS hours_logged, $3 AS miles_logged;
grp_logged = GROUP timesheet_logged by driverId;
sum_logged = FOREACH grp_logged GENERATE group as driverId,
SUM(timesheet_logged.hours_logged) as sum_hourslogged,
SUM(timesheet_logged.miles_logged) as sum_mileslogged;
join_sum_logged = JOIN sum_logged by driverId, drivers_details by driverId;
join_data = FOREACH join_sum_logged GENERATE $0 as driverId, $4 as name, $1 as hours_logged, $2 as miles_logged;

dump join_data;

The script fails to finish. It just keeps running. Why is this and why can I not use the dump command?

10 REPLIES 10

Re: Pig dump command causes pig script to run forever

Expert Contributor

@John Cleveland

Could you please post the complete script? I don't see where the relational join_data is defined in your script.

Also, post the log details.

Re: Pig dump command causes pig script to run forever

Explorer

The entire script is there. Just scroll down.

thanks john

Re: Pig dump command causes pig script to run forever

could you add some small sample files (or links of where to grab them) for the timesheet and driver CSV files, too?

Re: Pig dump command causes pig script to run forever

Explorer

I'll get the logs to you in a bit, thanks.

Re: Pig dump command causes pig script to run forever

Explorer

I am not really doing anything fancy. Essentially I am just following the tutorials for the sandbox. I am just using a cluster.

1. Here is the tutorial link. I am just following this tutorial but on a cluster that I wet up using the quickstart tutorial (https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+for+New+VM+Users)

https://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/

The csv files links are all on the tutorial page.

2. Log files: job-1496737154796-0008-logs.txt

3. FULL SCRIPT:

truck_events = LOAD '/user/maria_dev/truck_event_text_partition.csv' USING PigStorage(',')
AS (driverId:int, truckId:int, eventTime:chararray,
eventType:chararray, longitude:double, latitude:double,
eventKey:chararray, correlationId:long, driverName:chararray,
routeId:long,routeName:chararray,eventDate:chararray);
DESCRIBE truck_events;
truck_events_subset = LIMIT truck_events 100;
DESCRIBE truck_events_subset;

DUMP truck_events_subset;

Re: Pig dump command causes pig script to run forever

Expert Contributor

@John Cleveland

Your script looks correct and I see below error in your log file. Not sure why it's not able to setup the load function.

2017-06-06 19:14:44,999 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2081: Unable to setup the load function.

I ran the same script with same dataset and my script completed within a minute. PFA for the screenshots.

16064-screen-shot-2017-06-06-at-80307-pm.png


screen-shot-2017-06-06-at-80321-pm.png

Re: Pig dump command causes pig script to run forever

Explorer

Satish,

Thanks, just looking at your screen showed me that I had made a stupid error in my url. My script will work just as yours above will, however here is the problem:

If you take out the LIMIT, it will run forever.

In your above script:

DUMP truck_events;

This change is enough to put my script in an infinite loop. Could you try this just to see what happens with yours. Thanks

Re: Pig dump command causes pig script to run forever

Explorer

Here is a summary of what I have discovered so far. If I open a grunt shell as hdfs and run pig commands as hdfs, there are no problems.

I am usually using vagrant as the user and that was giving me problems on the command line.

So I am thinking that something similar is happening when I am using Pig View/ambari-server.

Summary: I can use the shell to run pig scripts on a node that has pig-client as long as I open the grunt shell as hdfs. However, how do I transport this finding to help me with ambari/Pig View ??

thanks

Re: Pig dump command causes pig script to run forever

Explorer

Here is a summary of what I have discovered so far. If I open a grunt shell as hdfs and run pig commands as hdfs, there are no problems.

I am usually using vagrant as the user and that was giving me problems on the command line.

So I am thinking that something similar is happening when I am using Pig View/ambari-server.

Summary: I can use the shell to run pig scripts on a node that has pig-client as long as I open the grunt shell as hdfs. However, how do I transport this finding to help me with ambari/Pig View ??

thanks