Support Questions

cjervis · ‎03-25-2022

Hello Team,

Spark job through Oozie is failing with the below exception in the Prod cluster. Note that the same job passes in the lower clusters (f.e. Dev)

22/03/24 05:05:22 INFO spark.SparkContext: Invoking stop() from shutdown hook
22/03/24 05:05:22 ERROR scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:477)
        at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:627)
        at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:583)
        at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:134)
        at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:145)
        at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:145)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:145)
        at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:191)
        at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:57)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
        at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1231)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
22/03/24 05:05:22 INFO server.AbstractConnector: Stopped Spark@12bcc45b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
22/03/24 05:05:22 INFO ui.SparkUI: Stopped Spark web UI at http://xxxxxxxxxxxxxxx:33687
22/03/24 05:05:22 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
22/03/24 05:05:22 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
22/03/24 05:05:22 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
22/03/24 05:05:22 ERROR util.Utils: Uncaught exception in thread shutdown-hook-0
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:477)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1685)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1745)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1742)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1757)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1723)
        at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:249)
        at org.apache.spark.SparkContext$$anonfun$stop$9$$anonfun$apply$mcV$sp$7.apply(SparkContext.scala:1966)
        at org.apache.spark.SparkContext$$anonfun$stop$9$$anonfun$apply$mcV$sp$7.apply(SparkContext.scala:1966)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.SparkContext$$anonfun$stop$9.apply$mcV$sp(SparkContext.scala:1966)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1269)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1965)
        at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:578)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1874)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
22/03/24 05:05:22 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/24 05:05:22 INFO memory.MemoryStore: MemoryStore cleared
22/03/24 05:05:22 INFO storage.BlockManager: BlockManager stopped
22/03/24 05:05:22 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
22/03/24 05:05:22 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/24 05:05:22 INFO spark.SparkContext: Successfully stopped SparkContext
22/03/24 05:05:22 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
22/03/24 05:05:22 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

Already looked into the NameNode logs and couldn't find any ERROR regarding this. Please help in resolving the issue.

Best Regards

Sayed016 · ‎03-29-2022

With the help of @mszurap we could narrow down the issue. There are 2 issues, the first one was coming due to the OOM and the second one was from the application itself.

Below are some of the logs that we noticed during the Oozie job run.

22/03/23 13:18:54 INFO mapred.SparkHadoopMapRedUtil: attempt_20220323131847_0000_m_000000_0: Committed
22/03/23 13:18:54 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1384 bytes result sent to driver
22/03/23 13:19:55 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
22/03/23 13:19:55 INFO storage.DiskBlockManager: Shutdown hook called
22/03/23 13:19:55 INFO util.ShutdownHookManager: Shutdown hook called

from Miklos, "executor.Executor" ... "RECEIVED SIGNAL TERM" is completely normal that an executor is killed by the AM/Driver. Since the Spark job was succeeding in the lower environments (like Dev/Test) the suggestion was to check if the application is using the same dependencies too in lower environments (get the Spark event logs for the good and bad run). Also to check the driver YARN logs, there could be a possibility of some abrupt exit due to an OOM.

We then looked in the direction of the OOM, and also checked if there were no System.exit() calls in the Spark code. We updated the driver memory to 2GB and ran the job, now we can see the actual error (the error from the application). Hope this helps someone in the future.

View solution in original post

mszurap · ‎03-28-2022

Hello @Sayed016 , In general the

java.io.IOException: Filesystem closed

message happens when the same or a different thread in the same JVM called the "FileSystem.close()" (see JavaDoc) method - and something later tries to access the HDFS filesystem. (in this case the "EventLoggingListener.stop()" tries to access the HDFS to flush the Spark event logs to HDFS)

FileSystem.close() should not be called by any custom code, as there is a single shared instance of the FileSystem object in any given JVM instance and it can cause failures for the still running frameworks like Spark.

This suggests that the Spark application has the above FileSystem.close() call somewhere in the code. Please review the code and remove those.

Hope that helps.

Best regards, Miklos

Sayed016 · ‎03-29-2022

With the help of @mszurap we could narrow down the issue. There are 2 issues, the first one was coming due to the OOM and the second one was from the application itself.

Below are some of the logs that we noticed during the Oozie job run.

22/03/23 13:18:54 INFO mapred.SparkHadoopMapRedUtil: attempt_20220323131847_0000_m_000000_0: Committed
22/03/23 13:18:54 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1384 bytes result sent to driver
22/03/23 13:19:55 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
22/03/23 13:19:55 INFO storage.DiskBlockManager: Shutdown hook called
22/03/23 13:19:55 INFO util.ShutdownHookManager: Shutdown hook called

from Miklos, "executor.Executor" ... "RECEIVED SIGNAL TERM" is completely normal that an executor is killed by the AM/Driver. Since the Spark job was succeeding in the lower environments (like Dev/Test) the suggestion was to check if the application is using the same dependencies too in lower environments (get the Spark event logs for the good and bad run). Also to check the driver YARN logs, there could be a possibility of some abrupt exit due to an OOM.

We then looked in the direction of the OOM, and also checked if there were no System.exit() calls in the Spark code. We updated the driver memory to 2GB and ran the job, now we can see the actual error (the error from the application). Hope this helps someone in the future.

VidyaSargur · ‎03-29-2022

@Sayed016, Thanks for providing the details on the issue and credits to @mszurap! I have accepted your response as a solution, which should surely help others in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Support Questions

Spark job through Oozie is failing with - scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception