Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark on Yarn job runs a couple of times

Spark on Yarn job runs a couple of times

New Contributor

I run a job with the following statement:

 

spark-submit --class com.decisioniq.spark.SparkTrainCSVParser --master yarn-client --driver-memory 64M --num-executors 1 data-integration-1.0-SNAPSHOT-jar-with-dependencies.jar test_FD001.csv 12

 

this job runs several times (sometimes twice other times three times), but then it hangs.

15/02/12 22:02:18 INFO Client: Application report for application_1423792519222_0005 (state: ACCEPTED)
15/02/12 22:02:19 INFO Client: Application report for application_1423792519222_0005 (state: ACCEPTED)

 

for a long time, never moving on. Is my job polluting the memory pool that I have available? If so, how do I clean it.

 

12 REPLIES 12

Re: Spark on Yarn job runs a couple of times

Master Collaborator

What does the YARN resource manager show you? I suspect it will show you that the cluster isn't allocating the resource you ask for. That's strange since you should be asking for the default of 1 core and 512M which is tiny. BTW 64M is way too small for the driver. I don't see evidence that this is the error, but it sounds like a problem.

Re: Spark on Yarn job runs a couple of times

New Contributor

I suspect, you are right, that my problem is in the amount of resources that I have available, but I want to find evidence of that. I am struggling to find where the system is telling me that it is running out of resources. 

 

The interesting thing about my problem is that if I wait long enough, whatever was tying resources goes away and I am able to run the job once again. Anyways, I just want to find where I the system is telling me that it needs more memory. 

 

Container Memory
yarn.nodemanager.resource.memory-mb

1.95 GiB

 

Container Memory Increment
yarn.scheduler.increment-allocation-mb
64 MiB

 

Here is the command I used.

spark-submit --class com.decisioniq.spark.SparkTrainCSVParser --master yarn-client --num-executors 1 --executor-memory 724M --driver-memory 128M data-integration-1.0-SNAPSHOT-cleanCache.jar test_FD001_small.csv 12

 

The first time I ran the command above it worked. 

 

The second time I stall:

 

The last couple lines of the spark yarn client where it stalls. It looks like I have about 374MB left of free memory.

15/02/14 09:11:31 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on diq-demo:54267 (size: 2.2 KB, free: 375.0 MB)
15/02/14 09:11:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on diq-demo:54267 (size: 20.6 KB, free: 375.0 MB)
15/02/14 09:11:35 INFO BlockManagerInfo: Added rdd_1_0 in memory on diq-demo:54267 (size: 246.0 KB, free: 374.8 MB)

 

The ResourceManager has these statements at the end.

2015-02-14 09:11:17,408 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1423792519222_0022_000001
2015-02-14 09:11:17,408 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=geo	IP=104.131.184.19	OPERATION=Register App Master	TARGET=ApplicationMasterService	RESULT=SUCCESS	APPID=application_1423792519222_0022	APPATTEMPTID=appattempt_1423792519222_0022_000001
2015-02-14 09:11:17,409 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1423792519222_0022_000001 State change from LAUNCHED to RUNNING
2015-02-14 09:11:17,409 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1423792519222_0022 State change from ACCEPTED to RUNNING
2015-02-14 09:11:21,526 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1423792519222_0022_01_000002 Container Transitioned from NEW to ALLOCATED
2015-02-14 09:11:21,527 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=geo	OPERATION=AM Allocated Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1423792519222_0022	CONTAINERID=container_1423792519222_0022_01_000002
2015-02-14 09:11:21,527 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1423792519222_0022_01_000002 of capacity <memory:1152, vCores:1> on host diq-demo:8041, which has 2 containers, <memory:1664, vCores:2> used and <memory:333, vCores:2> available after allocation
2015-02-14 09:11:22,568 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : diq-demo:8041 for container : container_1423792519222_0022_01_000002
2015-02-14 09:11:22,569 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1423792519222_0022_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2015-02-14 09:11:23,617 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1423792519222_0022_01_000002 Container Transitioned from ACQUIRED to RUNNING

 

 

The NodeManager keeps spewing these two statements over and over. 

2015-02-14 09:26:28,498 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 17031 for container-id container_1423792519222_0022_01_000002: 386.3 MB of 1.1 GB physical memory used; 1.4 GB of 2.4 GB virtual memory used
2015-02-14 09:26:28,529 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 16946 for container-id container_1423792519222_0022_01_000001: 165.4 MB of 512 MB physical memory used; 787.4 MB of 1.0 GB virtual memory used

 

Container Logs:

container_1423792519222_0022_01_000002 - Container doing the work

15/02/14 09:11:35 INFO storage.MemoryStore: ensureFreeSpace(251911) called with curMem=378017, maxMem=393247457
15/02/14 09:11:35 INFO storage.MemoryStore: Block rdd_1_0 stored as bytes in memory (estimated size 246.0 KB, free 374.4 MB)
15/02/14 09:11:35 INFO storage.BlockManagerMaster: Updated info of block rdd_1_0

 

container_1423792519222_0022_01_000001 - Application Master

15/02/14 09:11:23 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/lib/native:$LD_LIBRARY_PATH", $JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms724m -Xmx724m , -Djava.io.tmpdir=$PWD/tmp, -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://sparkDriver@diq-demo:58389/user/CoarseGrainedScheduler, 1, diq-demo, 1, application_1423792519222_0022, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/02/14 09:11:23 INFO impl.ContainerManagementProtocolProxy: Opening proxy : diq-demo:8041

 

 What I would like to understand here is statements helping me determine that my memory allocation is unbalanced. 

 

 

Re: Spark on Yarn job runs a couple of times

Master Collaborator

At a glance that looks reasonable. It even looks like you have allocated the YARN containers, but can you confirm that? You see this in the resource manager. If so, then you're running and what happens from there is up to your job.

Re: Spark on Yarn job runs a couple of times

New Contributor

How can I confirm that I have allocated YARN containers. I thought I did with the logs from the resource manager. 

 

What I am confused about is why my job runs sometimes and then under the same configuration later, it will not. I recently updated my spark job to call .unpersist() on the RDD to remove it from Memory but it seems like something in the system clutters memory for the next runs. 

Re: Spark on Yarn job runs a couple of times

Master Collaborator

Are you looking at the Resource Manager? there is no need to dig around logs. You really need to know whether you have executors or not. It sounds like your cluster lacks resources to schedule your job.

Re: Spark on Yarn job runs a couple of times

New Contributor

I had executors that I could see in the spark UI :4040. I think there are additional memory resources (from YARN infrastructure) that once used - during the first run - are not returned to the "free" memory pool  I just don't know where these are or how to get the system to tell me where the memory resource mismatch is. I made sure that I was not overcommitting memory on the host using Cloudera Manager.

 

What I have done is instead of count memory bytes (which is what I am trying to do) I run the jobs in '--master local' mode. This gives me predictability, and I can rerun the job many times (ran it 10 times) the resources are released back to me which I could see from 'top' command. 

 

This command allows me to run it multiple times. 

spark-submit --class com.decisioniq.spark.SparkTrainCSVParser --master local --deploy-mode client --driver-memory 128M --executor-memory 724M  data-integration-1.0-SNAPSHOT-cleanCache.jar test_FD001.csv 12

 

I'll use YARN when I can allocate the appropriate resources to give me stability.

Re: Spark on Yarn job runs a couple of times

Master Collaborator

The reason I say you should look at the resource manager is that it will tell you exactly what is running and what resources are available. But you are showing that you have allocated executors on YARN, right? So that's not the problem. 

 

There is no way you can 'leak' memory here. That's not an issue.

Re: Spark on Yarn job runs a couple of times

New Contributor

Yes, I am showing resource allocation (workers) in the spark UI and the resource manager picture was attached earlier in this thread for the stalled job.

 

Don't mean to imply a memory leak in YARN, but what I do know is that the YARN resources I use during the first run of this job within the framework keep items in memory even after the job run has completed, otherwise I would be able to run these jobs over and over again just like I am able to do in master=local mode. This may be my misunderstanding about how yarn resources are managed in back to back application launches. Max I have seen this job run twice before the stall occurs. At which point, I can resteart the cluster or host and I get another run in.

 

Looking at the picture of what my resource manager looked like during the yarn managed run, It shows I am using 83.3% of resources (so clearly there are some resources left) but maybe the remaining memory (1.95GB - 1.63GB) 320MB is insufficient for what the job requires (this second run) so it stalls waiting. 

 

The reason I think that resource manager is insufficient to help me answer my question is because I see available resources (only 83.3% used, 320MB available) but my job does not complete the run. 

Re: Spark on Yarn job runs a couple of times

Master Collaborator

You mean executors are still running from old jobs? are you sure the old jobs are shut down? they don't seem to be. That would explain why eventually you have no resource available. 320MB is not enough to run 1 executor with deafult memory request.