I'm running a very simple PIG Script which is shown as follows :
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int); metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRelease:chararray, imdbLink:chararray); nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime; ratingsByMovie = GROUP ratings BY movieID; avgRatings = FOREACH ratingsByMovie GENERATE group AS movieID, AVG(ratings.rating) AS avgRating; fiveStarMovies = FILTER avgRatings BY avgRating > 4.0; fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID; oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime; DUMP oldestFiveStarMovies;
But after hitting the execute button in PIG View, it has been running since the last 1 hour. I am unable to see any progress. I have attached the screenshot as well.
The data that I am using consists of around 100,000 ratings from around 1000 users. Does this happen by default ? Is it natural for PIG to take too much time ?
Is there any error here ? I am pretty sure that there is no error in the code .. but still PIG is taking too much time to execute the script.
Can someone please throw some light on this and guide me ?
Based on the below code, releaseDate is not declared. Did you want to put 'videoRelease' instead of 'releaseDate'
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, videoRelease:chararray, imdbLink:chararray); nameLookup = FOREACH metadata GENERATE movieID, movieTitle,ToUnixTime(ToDate(releaseDate,'dd-MMM-yyyy')) AS releaseTime;
Also modify this line
fiveStarMovies = FILTER avgRatings BY avgrating >4.0; to
fiveStarMovies = FILTER avgRatings BY avgRating >4.0;
The slowness could be because of resources in Yarn. Check if any YARN applications are already running. You can see it in RM UI. Go to Yarn -> QuickLinks -> ResoruceManager UI.
See if your application is in Accepted/Running state. Also observer the memory taken, Vcores used etc. If your data is small then it should be run within a minute.
I edited the code and the question as well. But suddenly the ambari server crashed I think. In the console of sanbox, when I typed 'ambari-server restart', I got the following error :
ambari-server restart Using python /usr/bin/python Restarting ambari-server Ambari Server is not running Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Ambari database consistency check started... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start.........Unable to determine server PID. Retrying... ......Unable to determine server PID. Retrying... ......Unable to determine server PID. Retrying... ERROR: Exiting with exit code -1. REASON: Ambari Server java process died with exitcode 255. Check /var/log/ambari-server/ambari-server.out for more information.
I have posted about this error in a new question : https://community.hortonworks.com/questions/158800/ambari-server-restart-ambari-server-java-process-...