About marcosrodrigues

marcosrodrigues · ‎06-02-2022

Hi guys! Finally we solved the problem. To fix it, we moved all content from "yarn.nodemanager.recovery.dir" config path to another one (i.e mv yarn-rm-recovery yarn-rm-recovery-backup) and we created yarn-rm-recovery again, grant permisison to yarn:hadoop to folder. After that, we can start NodeManager with no error. Thanks all!

marcosrodrigues · ‎05-27-2022

@mszurap , I did what u said, I copied all content from /appN/yarn/nm, including directories "filecache", "nmPrivate" and "usercache". By this way, the dir "/appN/yarn/nm" was with 0 dir and 0 files. Then I started NodeManager by ClouderaManager, and got the same error, with all services running on that machine start successfully, except NodeManager..Also, the directory "/appN/yarn/nm" stays with no content, even after tried to put it run by CM. I realized that when I run "yarn nodemanager" with root user, the nodemanager run with no error, and with some parameters different when comparting with CM start command, but ClouderaManager doesn't recognize the Node, and I got that when I run it by CM, the command has some parameters (that is the same when comparting with NodeManager that is ok). Maybe it can be something with Yarn user? About the RM StateStore, I didn't find any information about that. At link u sent, that say to run "-format-state-store" only if ResourceManager is not running, and in cluster, it is running okay, and recognzing 3 of 5 nodes. Unfortunaltey, it is a production cluster, so I think I can't stop Yarn whole at all. Do u have any sugestion?

marcosrodrigues · ‎05-27-2022

Hi @mszurap , Yes, all NodeManager is configured the same way. I checked it by Cloudera Manager, setting by setting. Also, the NodeManager Local Directories is setted And is readble/writeable by "yarn" user: I checked about the same path in other nodes, and the permission was right, but the only diff is that on machines working fine, the dir "nmPrivate" was updated minutes ago, and about 03 (with issues) the last updated was May 9, at 17:31 (kind of same time we recognized the Node has shutdown). Folders aren't empty. And about NullPointesException, its curious, cause we start it with DEBUG and TRACE logs active, and in DEBUG we found the follow, before WANR about NPE: 2022-05-26 19:51:17,963 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DeletionTask: Running DeletionTask : FileDeletionTask : id : 2543016 user : null subDir : null baseDir : null 2022-05-26 19:51:17,963 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DeletionTask: NM deleting absolute path : null 2022-05-26 19:51:17,964 DEBUG org.apache.hadoop.util.concurrent.ExecutorHelper: afterExecute in thread: DeletionService #0, runnable type: java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask 2022-05-26 19:51:17,964 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: Execution exception when running task in DeletionService #0 2022-05-26 19:51:17,965 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: Caught exception in thread DeletionService #0: java.lang.NullPointerException: path cannot be null at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204) at org.apache.hadoop.fs.FileContext.fixRelativePart(FileContext.java:270) at org.apache.hadoop.fs.FileContext.delete(FileContext.java:768) at org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.FileDeletionTask.run(FileDeletionTask.java:109) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) It says the user is "null", subDir is "null" and baseDir is "null".. But i don't know where yarn goes to locate that user, subdir and baseDir. Any idea? Thanks!

marcosrodrigues · ‎05-26-2022

I've tried to start NodeManager with DEBUG log and got it: That is the only WARN we have in log.

marcosrodrigues · ‎05-26-2022

On that 2 nodes (03 and 04), I've stopped NodeManager cause that was consuming 100% od CPU, and I wouldn't that cause more incident on cluster at all.

marcosrodrigues · ‎05-26-2022

Kind of, Elias! Most of time is healthy, but now I've seen that is a NTP clock integrity problem on 1 of 9 machines, as follow: The NodeManager problem affected hosts SPAPCRK03 and SPAPCRK04. SPAPCRK03 s healthy, and SPAPCRK04 has an alert of NTP clock. I runned "ntpdc -np" and got this output: But i don't know if that NTP problem is cause of NodeManager doesn't start.

marcosrodrigues · ‎05-26-2022

Hello folks! I have a cluster with 9 machines, running on CDH 6.2 (OnPremise). I have 3 master, 1 edge and 5 workers. I am not able to up 2 of 5 NodeManager on workers. 3 of them are ok, and 2 of them give me a follow log (attach), without error but a Warning with "NullPointerException":. When I put the NodeManager to run, on Cloudera Manager it doens't fail, but I got two alerts, as follow: - NodeManager can not connect to ResourceManager - ResourceManager could not connect to Web Server of NodeManager Also, I can't access the /jmx of the server. And, when I run NodeManager by Cloudera Manager, my CPU going to use of 100%. On that 2 workers, I have RegionServer and DataNode working fine, the problem is only with NodeManager. Please, any suggest?

Online	Offline
Last Visited	‎10-11-2022 09:42 PM

Member Since	‎05-26-2022 09:13 AM
Last Visited	‎10-11-2022 09:42 PM
Posts	7

Cloudera Community

Re: NodeManager fails to start

Re: NodeManager fails to start

Re: NodeManager fails to start

Re: NodeManager fails to start

Re: NodeManager fails to start

Re: NodeManager fails to start

Re: NodeManager fails to start

NodeManager fails to start