Created 09-29-2017 01:45 PM
i am trying to setup an working cluster
i am based on hadoop-2.7.4 and spark-2.2.0-bin-hadoop2.7
i have configured shared storage in
17/09/28 11:46:12 INFO ipc.Server: Auth successful for appattempt_1506588244498_0002_000001 (auth:SIMPLE) 17/09/28 11:46:12 INFO authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1506588244498_0002_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 17/09/28 11:46:12 INFO containermanager.ContainerManagerImpl: Start request for container_1506588244498_0002_01_000001 by user usertest1 17/09/28 11:46:12 INFO containermanager.ContainerManagerImpl: Creating a new application reference for app application_1506588244498_0002 17/09/28 11:46:12 INFO application.ApplicationImpl: Application application_1506588244498_0002 transitioned from NEW to INITING 17/09/28 11:46:12 INFO nodemanager.NMAuditLogger: USER=usertest1IP=10.0.0.101OPERATION=Start Container RequestTARGET=ContainerManageImplRESULT=SUCCESSAPPID=application_1506588244498_0002CONTAINERID=container_1506588244498_0002_01_000001 17/09/28 11:46:12 INFO application.ApplicationImpl: Adding container_1506588244498_0002_01_000001 to application application_1506588244498_0002 17/09/28 11:46:12 WARN logaggregation.LogAggregationService: Remote Root Log Dir [/mnt/docs/Grid/griddata/logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users. 17/09/28 11:46:12 WARN logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 17/09/28 11:46:12 INFO application.ApplicationImpl: Application application_1506588244498_0002 transitioned from INITING to RUNNING 17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from NEW to LOCALIZING 17/09/28 11:46:12 INFO containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1506588244498_0002 17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip transitioned from INIT to DOWNLOADING 17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/home/local/VELOQUANT/usertest1/.sparkStaging/application_1506588244498_0002/__spark_conf__.zip transitioned from INIT to DOWNLOADING 17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Created localizer for container_1506588244498_0002_01_000001 17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /tmp/nm-local-dir/nmPrivate/container_1506588244498_0002_01_000001.tokens. Credentials list: 17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Initializing user usertest1 17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Copying from /tmp/nm-local-dir/nmPrivate/container_1506588244498_0002_01_000001.tokens to /tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002/container_1506588244498_0002_01_000001.tokens 17/09/28 11:46:12 INFO nodemanager.DefaultContainerExecutor: Localizer CWD set to /tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002 = file:/tmp/nm-local-dir/usercache/usertest1/appcache/application_1506588244498_0002 17/09/28 11:46:12 INFO authorize.ServiceAuthorizationManager: Authorization successful for usertest1 (auth:SIMPLE) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 17/09/28 11:46:12 WARN localizer.ResourceLocalizationService: { file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip, 1506588370000, ARCHIVE, null } failed: File file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip does not exist java.io.FileNotFoundException: File file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 17/09/28 11:46:12 INFO localizer.LocalizedResource: Resource file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip(->/tmp/nm-local-dir/usercache/usertest1/filecache/14/__spark_libs__1914925494756523748.zip) transitioned from DOWNLOADING to FAILED 17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED 17/09/28 11:46:12 INFO localizer.LocalResourcesTrackerImpl: Container container_1506588244498_0002_01_000001 sent RELEASE event on a resource request { file:/tmp/spark-ff52ce90-2a7f-4cfb-992c-1b74cca0eed7/__spark_libs__1914925494756523748.zip, 1506588370000, ARCHIVE, null } not present in cache. 17/09/28 11:46:12 INFO localizer.ResourceLocalizationService: Unknown localizer with localizerId container_1506588244498_0002_01_000001 is sending heartbeat. Ordering it to DIE java.io.InterruptedIOException: Call interrupted at org.apache.hadoop.ipc.Client.call(Client.java:1470) at org.apache.hadoop.ipc.Client.call(Client.java:1413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy75.heartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:63) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:255) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1117) 17/09/28 11:46:12 WARN nodemanager.NMAuditLogger: USER=usertest1OPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: LOCALIZATION_FAILEDAPPID=application_1506588244498_0002CONTAINERID=container_1506588244498_0002_01_000001 17/09/28 11:46:12 INFO container.ContainerImpl: Container container_1506588244498_0002_01_000001 transitioned from LOCALIZATION_FAILED to DONE
Created 10-02-2017 08:38 AM
Hi @ilia kheifets,
Can you please verify UID in /etc/passwd and GID's for the user "usertest1" are consistent and available across the cluster(in case of local authentication).
apart from that, you can try checking with yarn client mode and copy the core-site.xml from /etc/hadoop/conf to SPARK Confidence directory across all the nodes in cluster.
I presume something wrong with sharing the libraries between the Nodes in cluster.
Hope this helps!!
Created 10-09-2017 07:02 AM
The user is the same domain user, with same uid
core-site.xml is the same file across all nodes,