Support Questions

wkupersa · ‎03-07-2017

Hello,

I am running Ipython -> livy to send jobs to my CDH 5.9.0 cluster running Spark. My job runs through a few operations reading files from HDFS into dataframes and then doing some operations on those dataframes. The code then reaches a cell with a join and stops progressing. If I leave it along for long enough, the session is eventually killed.

I am not sure how to debug this. Yarn shows the job as still running. Spark shows all jobs completed and no active of pending jobs. All the Spark jobs say that they succeeded though some were skipped. If I go to the details for the last stage, all statuses say "Success." The logs for the executors all say Finished task ###. #### bytes sent to driver. The thread dump for the driver shows a lot of waiting threads. If I run the job via pyspark, not through Ipython/Livy, it works fine. But there are no errors in the livy log either.

I'm not sure how to figure this out. Any thoughts?

Thanks!

wkupersa · ‎03-08-2017

A bit more info.... (and this is cross-posted in project jupyter list)

I think that messaging is getting screwed up between Pyspark and Livy. When the last cell is executed, I will see this on the client side.

2017-03-08 22:24:48,505 INFO EventsHandler InstanceId: 0e1c8fd2-047e-4337-b264-5b64ba74de5a,EventName: notebookStatementExecutionStart,Timestamp: 2017-03-08 22:24:48.504920,SessionGuid: 03d14478-6adc-4b

34-abef-b9b6fd400543,LivyKind: pyspark,SessionId: 8,StatementGuid: f1933b11-b767-4a18-b311-c48901ad8369

2017-03-08 22:24:48,788 DEBUG Command Status of statement 8 is running.

2017-03-08 22:24:50,920 DEBUG Command Status of statement 8 is running.

...and it never comes back.

On the livy end, I see

17/03/08 17:26:26 INFO ContextLauncher: 17/03/08 17:26:26 INFO scheduler.DAGScheduler: ResultStage 17 (collect at <stdin>:5) finished in 1.521 s

17/03/08 17:26:26 INFO ContextLauncher: 17/03/08 17:26:26 INFO scheduler.DAGScheduler: Job 8 finished: collect at <stdin>:5, took 3.729078 s

17/03/08 17:26:27 DEBUG RpcDispatcher: [ClientProtocol] Registered outstanding rpc 230 (com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult).

17/03/08 17:26:27 DEBUG KryoMessageCodec: Encoded message of type com.cloudera.livy.rsc.rpc.Rpc$MessageHeader (6 bytes)

17/03/08 17:26:27 DEBUG KryoMessageCodec: Encoded message of type com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult (91 bytes)

17/03/08 17:26:27 DEBUG KryoMessageCodec: Decoded message of type com.cloudera.livy.rsc.rpc.Rpc$MessageHeader (6 bytes)

17/03/08 17:26:27 DEBUG KryoMessageCodec: Decoded message of type com.cloudera.livy.rsc.rpc.Rpc$NullMessage (2 bytes)

17/03/08 17:26:27 DEBUG RpcDispatcher: [ClientProtocol] Received RPC message: type=REPLY id=230 payload=com.cloudera.livy.rsc.rpc.Rpc$NullMessage

17/03/08 17:26:28 DEBUG RpcDispatcher: [ClientProtocol] Registered outstanding rpc 231 (com.cloudera.livy.rsc.BaseProtocol$GetReplJobResult).

ad infinitum

So, with my limited knowledge, it looks to me that Livy thinks it has sent a result to a finished job, but pyspark hasn't received it.

Anyone seen this before? Any thoughts?

Cloudera Community

Support Questions

Spark shows all jobs completed. iPython still wait