About sumit.nigam

sumit.nigam · ‎08-12-2015

I have a 2 node cluster with each node having 8GB RAM and 4 cores. On both node, there are apps running having consumed 2 cores each. This leaves me with 2 cores (x 2) on both nodes. Memory used is 4GB of total 16GB available to YARN containers. Some important properties: yarn.nodemanager.resource.memory-mb = 20GB (overcomitted, as I see) yarn.scheduler.minimum-allocation-mb = 1GB yarn.scheduler.maximum-allocation-mb = 5.47GB yarn.nodemanager.resource.cpu-vcores = 12 yarn.scheduler.minimum-allocation-vcores = 1 yarn.scheduler.maximum-allocation-vcores = 12 Using Fair scheduler. With above setting. when I spark-submit the app remains in ACCEPTED state. Here is what I am requesting through this: spark.driver.memory=2G spark.master=yarn-client spark.executor.memory=1G num-executors = 2 executor-memory = 1G executor-cores = 1 As I see, I am requesting a total of 3 cores (1 for driver, by default and 1x2 for executors). A single node does not have 3 cores but has 2 cores. So, ideally I should see them distributed across 2 nodes. Not sure why the spark job should remain in ACCEPTED state. My default queue shows only 25% usage. I notice the following settings too for my default root.default queue: Used Capacity: 25.0% Used Resources: <memory:4096, vCores:4> Num Schedulable Applications: 2 Num Non-Schedulable Applications: 1 Num Containers: 4 Max Schedulable Applications: 2 Max Schedulable Applications Per User: 2 Why do I only get 4 containers in total? Or does it this indicate currently used containers (which in my case is 4)? Plus, why is max schedulable apps only 2? I have not set any user level limits or queue level limits under Dynamic Resource Pool settings.

sumit.nigam · ‎08-10-2015

I also think we can probably compress the binaries before being copied to HDFS and have YARN uncompress them somehow?

sumit.nigam · ‎08-07-2015

Are there any recommendations to speed up deployment of app binaries to YARN? I've been using RM REST APIs to submit apps to it with binaries located on HDFS. This tends to take a lot of time when the size of binaries to be deployed as YARN app are big in size (say, >500MB or more), and also when number of containers that I need are high. I could probably speed this up by : 1. Turning off default 3 copies needed on HDFS 2. Using HDFS cluster-wide cache which can help avoid block reads 3. Using YARN resource localization Do you have any recommendations which are definitely known to speed this up? Thanks, Sumit

sumit.nigam · ‎07-29-2015

Does FairScheduler take only memory into consideration when making a decision or does it also use vcores? If it can depend upon multiple reasons, then again this may be another CR wherein user can get to know the exact reason (possibly through an API call) as to why an app is in ACCPETED state (such as memory, cores, disk space, queue limits, etc.)

sumit.nigam · ‎07-29-2015

You beat me to the answer 🙂 Yes, I figured this has to be set in NodeManager Advanced Configuration Snippet (Safety Valve) for yarn-site.xml. Thanks!

sumit.nigam · ‎07-29-2015

Thanks Wilfred. I'd agree about not setting it to false. That's my idea too. The main reason to use that setting is to be able to do some functional testing without getting into tuning as yet. So, is there a way I can set this property through UI?

sumit.nigam · ‎07-28-2015

On point 1, I think I am getting hit by https://issues.apache.org/jira/browse/YARN-3103

sumit.nigam · ‎07-28-2015

Hbase on Yarn. On a side note: 1. Is there a reason for security token to just fail like that after 15 mins of trying? Or I have some setup problem? That seems to be the first reason for attempt one to be killed. 2. The last line about null container - I see it often. Is that a bug? And can that be ignored? Thanks, Sumit

sumit.nigam · ‎07-28-2015

Hi Harsh - You are right, there is a prior attempt which got killed. Here are some log snippets as you asked: Attempt 1 - app becomes RUNNING 2015-07-24 14:20:40,980 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from LAUNCHED to RUNNING 2015-07-24 14:20:40,981 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from ACCEPTED to RUNNING Some hrs later the tokens are renewed (900000ms) 2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens 2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Going to activate master-key with key-id 1834122077 in 900000ms 2015-07-25 13:56:35,841 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens 2015-07-25 13:56:35,842 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Going to activate master-key with key-id 516071750 in 900000ms The following 2 log lines keep repeating for the next 900000ms filling up logs: 2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_000001 2015-07-25 13:56:35,920 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001 ... ... 2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1437726395811_0010_00 0001 2015-07-25 14:11:35,772 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1437726395811_0010_000001 This fails (not sure why) and leads to app termination 2015-07-25 14:11:35,877 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8030: readAndProcess from client 10.65.144.85 threw exception [org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1437726395811_0010_000001] 2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1437726395811_0010_01_000001 Container Transitioned from RUNNING to COMPLETED 1st attempt done (RUNNING --> ACCEPTED) 2015-07-25 14:11:36,888 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from RUNNING to FINAL_SAVING 2015-07-25 14:11:36,889 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000001 State change from FINAL_SAVING to FAILED 2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1437726395811_0010 State change from RUNNING to ACCEPTED 2nd attempt starts 2015-07-25 14:11:36,890 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1437726395811_0010_000002 to scheduler from user: root 2015-07-25 14:11:36,891 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1437726395811_0010_000002 State change from SUBMITTED to SCHEDULED Not sure what this Null container indicates: 2015-07-25 14:11:36,910 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...

sumit.nigam · ‎07-28-2015

I notice in RM logs that sometime my application transitions back from RUNNING to ACCEPTED state. Under what conditions would this happen? I thought this usually happens whenever RM or AM dies and recovers the applications. Such apps would transition from RUNNING --> ACCEPTED. Is that correct? However, in my case both RM and NM recovery is disabled: yarn.resourcemanager.recovery.enabled = false yarn.nodemanager.recovery.enabled = false Thanks, Sumit

Online	Offline
Last Visited	‎04-06-2018 02:29 AM

Member Since	‎07-27-2015 09:45 PM
Last Visited	‎04-06-2018 02:29 AM
Posts	35
Kudos received	2

Cloudera Community

Re: How to set yarn.nodemanager.pmem-check-enabled...

Max apps per user.

Re: Speeding up deployment of app binaries

Speeding up deployment of app binaries

Re: JOB Stuck in Accepted State

Re: How to set yarn.nodemanager.pmem-check-enabled...

Re: How to set yarn.nodemanager.pmem-check-enabled...

Re: What does state transition RUNNING --> ACCEPTE...

Re: What does state transition RUNNING --> ACCEPTE...

Re: What does state transition RUNNING --> ACCEPTE...

What does state transition RUNNING --> ACCEPTED me...