Support Questions

Find answers, ask questions, and share your expertise

Spark shows all jobs completed. iPython still wait

avatar
Explorer

Hello,

 

I am running Ipython -> livy to send jobs to my CDH 5.9.0 cluster running Spark. My job runs through a few operations reading files from HDFS into  dataframes and then doing some operations on those dataframes. The code then reaches a cell with a join and stops progressing. If I leave it along for long enough, the session is eventually killed. 

 

I am not sure how to debug this. Yarn shows the job as still running. Spark shows all jobs completed and no active of pending jobs.  All the Spark jobs say that they succeeded though some were skipped. If I go to the details for the last stage, all statuses say "Success." The logs for the executors all say Finished task ###. #### bytes sent to driver. The thread dump for the driver shows a lot of waiting threads. If I run the job via pyspark, not through Ipython/Livy, it works fine.  But there are no errors in the livy log either.

 

I'm not sure how to figure this out. Any thoughts?

Thanks!

 

 

 

 

1 REPLY 1

avatar
Explorer

A bit more info.... (and this is cross-posted in project jupyter list)

 

I think that messaging is getting screwed up between Pyspark and Livy. When the last cell is executed, I will see this on the client side. 
 
 
2017-03-08 22:24:48,505 INFO    EventsHandler   InstanceId: 0e1c8fd2-047e-4337-b264-5b64ba74de5a,EventName: notebookStatementExecutionStart,Timestamp: 2017-03-08 22:24:48.504920,SessionGuid: 03d14478-6adc-4b
34-abef-b9b6fd400543,LivyKind: pyspark,SessionId: 8,StatementGuid: f1933b11-b767-4a18-b311-c48901ad8369
2017-03-08 22:24:48,788 DEBUG   Command Status of statement 8 is running.
2017-03-08 22:24:50,920 DEBUG   Command Status of statement 8 is running.
 
 ...and it never comes back.
 
On the livy end, I see 
 
17/03/08 17:26:26 INFO ContextLauncher: 17/03/08 17:26:26 INFO scheduler.DAGScheduler: ResultStage 17 (collect at <stdin>:5) finished in 1.521 s
17/03/08 17:26:26 INFO ContextLauncher: 17/03/08 17:26:26 INFO scheduler.DAGScheduler: Job 8 finished: collect at <stdin>:5, took 3.729078 s
17/03/08 17:26:27 DEBUG RpcDispatcher: [ClientProtocol] Registered outstanding rpc 230 (com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult).
17/03/08 17:26:27 DEBUG KryoMessageCodec: Encoded message of type com.cloudera.livy.rsc.rpc.Rpc$MessageHeader (6 bytes)
17/03/08 17:26:27 DEBUG KryoMessageCodec: Encoded message of type com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult (91 bytes)
17/03/08 17:26:27 DEBUG KryoMessageCodec: Decoded message of type com.cloudera.livy.rsc.rpc.Rpc$MessageHeader (6 bytes)
17/03/08 17:26:27 DEBUG KryoMessageCodec: Decoded message of type com.cloudera.livy.rsc.rpc.Rpc$NullMessage (2 bytes)
17/03/08 17:26:27 DEBUG RpcDispatcher: [ClientProtocol] Received RPC message: type=REPLY id=230 payload=com.cloudera.livy.rsc.rpc.Rpc$NullMessage
17/03/08 17:26:28 DEBUG RpcDispatcher: [ClientProtocol] Registered outstanding rpc 231 (com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult).
 
ad infinitum
 
So, with my limited knowledge, it looks to me that Livy thinks it has sent a result to a finished job, but pyspark hasn't received it.
Anyone seen this before? Any thoughts?