Created 04-13-2016 05:29 PM
hi,
When I run a spark job through Zeppelin I get the following output and the job just hangs and never returns. Does anyone have any idea how I could debug and address this problem?
I'm running spark 1.6 and HDP 2.4.
Thanks,
Mike
INFO [2016-04-13 18:01:17,746] ({Thread-65} Logging.scala[logInfo]:58) - Block broadcast_4 stored as values in memory (estimated size 305.3 KB, free 983.8 KB) INFO [2016-04-13 18:01:17,860] ({Thread-65} Logging.scala[logInfo]:58) - Block broadcast_4_piece0 stored as bytes in memory (estimated size 25.9 KB, free 1009.7 KB) INFO [2016-04-13 18:01:17,876] ({dispatcher-event-loop-0} Logging.scala[logInfo]:58) - Added broadcast_4_piece0 in memory on 148.88.72.84:56438 (size: 25.9 KB, free: 511.0 MB) INFO [2016-04-13 18:01:17,893] ({Thread-65} Logging.scala[logInfo]:58) - Created broadcast 4 from textFile at NativeMethodAccessorImpl.java:-2 INFO [2016-04-13 18:01:18,162] ({Thread-65} FileInputFormat.java[listStatus]:249) - Total input paths to process : 1 INFO [2016-04-13 18:01:18,279] ({Thread-65} Logging.scala[logInfo]:58) - Starting job: count at <string>:3 INFO [2016-04-13 18:01:18,317] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Got job 2 (count at <string>:3) with 2 output partitions INFO [2016-04-13 18:01:18,321] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Final stage: ResultStage 2 (count at <string>:3) INFO [2016-04-13 18:01:18,322] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Parents of final stage: List() INFO [2016-04-13 18:01:18,325] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Missing parents: List() INFO [2016-04-13 18:01:18,333] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Submitting ResultStage 2 (PythonRDD[8] at count at <string>:3), which has no missing parents INFO [2016-04-13 18:01:18,366] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Block broadcast_5 stored as values in memory (estimated size 6.2 KB, free 1015.9 KB) INFO [2016-04-13 18:01:18,406] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Block broadcast_5_piece0 stored as bytes in memory (estimated size 3.7 KB, free 1019.6 KB) INFO [2016-04-13 18:01:18,407] ({dispatcher-event-loop-1} Logging.scala[logInfo]:58) - Added broadcast_5_piece0 in memory on 148.88.72.84:56438 (size: 3.7 KB, free: 511.0 MB) INFO [2016-04-13 18:01:18,410] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Created broadcast 5 from broadcast at DAGScheduler.scala:1006 INFO [2016-04-13 18:01:18,416] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Submitting 2 missing tasks from ResultStage 2 (PythonRDD[8] at count at <string>:3) INFO [2016-04-13 18:01:18,417] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Adding task set 2.0 with 2 tasks INFO [2016-04-13 18:01:18,428] ({dag-scheduler-event-loop} Logging.scala[logInfo]:58) - Added task set TaskSet_2 tasks to pool default WARN [2016-04-13 18:01:23,225] ({Timer-0} Logging.scala[logWarning]:70) - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources WARN [2016-04-13 18:01:38,225] ({Timer-0} Logging.scala[logWarning]:70) - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Created 04-13-2016 06:40 PM
this looks like Yarn is not able to allocate containers for the executors. When you look at Yarn Resource Manager Ui, is there a job from zeppelin in accepted mode? If so, how much memory is available for Yarn to allocate (should be on the same ui)? If the job is in accepted state and there is no memory or not enough memory available the job will not start until Yarn gets resources freed up. If this is the case, try adding more memory for Yarn in ambari.
Created 04-13-2016 05:57 PM
Mike - Did you try running from spark -shell the same job ? is it success and only fails when run from zeppelin ?
Created 04-14-2016 01:03 PM
I can run the same job from the pyspark shell with no problems, it executes immediately.
Created 04-13-2016 06:40 PM
this looks like Yarn is not able to allocate containers for the executors. When you look at Yarn Resource Manager Ui, is there a job from zeppelin in accepted mode? If so, how much memory is available for Yarn to allocate (should be on the same ui)? If the job is in accepted state and there is no memory or not enough memory available the job will not start until Yarn gets resources freed up. If this is the case, try adding more memory for Yarn in ambari.
Created 04-14-2016 01:15 PM
zeppelin | Zeppelin | SPARK | default | Wed Apr 13 17:41:12 +0100 2016 | N/A | RUNNING | UNDEFINED |
It says that its still running and is using 66% of the queue/clsuter memory.
Created 04-17-2016 02:32 PM
What about cores? The Yarn RM UI should show the number of cores that Yarn has available to allocate. Are there any cores still available? Are there any other jobs in the running state? If you click on the application master link on the Yarn RM UI, that should take you to the Spark UI, is it showing any jobs as incomplete?
Created 04-18-2016 10:04 AM
I think this was the issue - ambari has auto-configured only one vCore. When I increased this it seemed to solve the problem.
Created 05-27-2016 04:40 PM
I'm having similar issues. Which parameter did you change in Ambari? For me it seems to fail part way through, often at the last stage.
Created 04-13-2016 06:58 PM
Another option you can try is clicking on the "Interpreters" link at the top of Zeppelin page, find the "spark" interpreter, and click on the "restart" button on the right-hand side. Next, make sure that your notebook page shows "Connected" with a green dot, meaning it is talking successfully with the Spark driver.