Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark on yarn error in localizer.ResourceLocalizationService

avatar
Rising Star

i am trying to setup an working cluster

i am based on hadoop-2.7.4 and spark-2.2.0-bin-hadoop2.7

i have configured shared storage in

  • spark.local.dir
  • spark.yarn.stagingDir
  • spark.eventLog.dir
  • yarn.nodemanager.remote-app-log-dir
  • yarn.nodemanager.log-dirs
but when i execute spark submit on master the following happen
  • i can see that the master received the job
  • it started to upload it to an available worker
  • the worker receive the job and have the following output
17/09/28 11:46:12 INFO ipc.Server: Auth successful for appattempt_1506588244498_0002_000001 (auth:SIMPLE)
17/09/28 11:46:12 INFO authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1506588244498_0002_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
17/09/28 11:46:12 INFO containermanager.ContainerManagerImpl: Start request for container_1506588244498_0002_01_000001 by user usertest1
17/09/28 11:46:12 INFO containermanager.ContainerManagerImpl: Creating a new application reference for app application_1506588244498_0002
17/09/28 11:46:12 INFO application.ApplicationImpl: Application application_1506588244498_0002 transitioned from NEW to INITING
17/09/28 11:46:12 INFO nodemanager.NMAuditLogger: USER=usertest1IP=10.0.0.101OPERATION=Start Container RequestTARGET=ContainerManageImplRESULT=SUCCESSAPPID=application_1506588244498_0002CONTAINERID=container_1506588244498_0002_01_000001
17/09/28 11:46:12 INFO application.ApplicationImpl: Adding container_1506588244498_0002_01_000001 to application application_1506588244498_0002
17/09/28 11:46:12 WARN logaggregation.LogAggregationService: Remote Root Log Dir [/mnt/docs/Grid/griddata/logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users.
17/09/28 11:46:12 WARN logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
17/09/28 11:46:12 INFO application.ApplicationImpl: Application application_1506588244498_0002 transitioned from INITING to RUNNING
17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from NEW to LOCALIZING
17/09/28 11:46:12 INFO containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1506588244498_0002
17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip transitioned from INIT to DOWNLOADING
17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/home/local/VELOQUANT/usertest1/.sparkStaging/application_1506588244498_0002/__spark_conf__.zip transitioned from INIT to DOWNLOADING
17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Created localizer for container_1506588244498_0002_01_000001
17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /tmp/nm-local-dir/nmPrivate/container_1506588244498_0002_01_000001.tokens. Credentials list: 
17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Initializing user usertest1
17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Copying from /tmp/nm-local-dir/nmPrivate/container_1506588244498_0002_01_000001.tokens to /tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002/container_1506588244498_0002_01_000001.tokens
17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Localizer CWD set to /tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002 = file:/tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002
17/09/28 11:46:12 INFO authorize.ServiceAuthorizationManager: Authorization successful for usertest1 (auth:SIMPLE) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
17/09/28 11:46:12 WARN localizer.ResourceLocalizationService: { file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip, 1506588370000, ARCHIVE, null } failed: File file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip does not exist
java.io.FileNotFoundException: File file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip(->/tmp/nm-local-dir/usercache/usertest1/filecache/14/__spark_libs__1914925494756523748.zip) transitioned from DOWNLOADING to FAILED
17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
17/09/28 11:46:12 INFO localizer.LocalResourcesTrackerImpl: Container container_1506588244498_0002_01_000001 sent RELEASE event on a resource request { file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip, 1506588370000, ARCHIVE, null } not present in cache.
17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Unknown localizer with localizerId container_1506588244498_0002_01_000001 is sending heartbeat. Ordering it to DIE
java.io.InterruptedIOException: Call interrupted
at org.apache.hadoop.ipc.Client.call(Client.java:1470)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy75.heartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:63)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:255)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:130)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1117)
17/09/28 11:46:12 WARN nodemanager.NMAuditLogger: USER=usertest1OPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: LOCALIZATION_FAILEDAPPID=application_1506588244498_0002CONTAINERID=container_1506588244498_0002_01_000001
17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from LOCALIZATION_FAILED to DONE
2 REPLIES 2

avatar
Super Collaborator

Hi @ilia kheifets,

Can you please verify UID in /etc/passwd and GID's for the user "usertest1" are consistent and available across the cluster(in case of local authentication).

apart from that, you can try checking with yarn client mode and copy the core-site.xml from /etc/hadoop/conf to SPARK Confidence directory across all the nodes in cluster.

I presume something wrong with sharing the libraries between the Nodes in cluster.

Hope this helps!!

avatar
Rising Star

The user is the same domain user, with same uid
core-site.xml is the same file across all nodes,