Community Articles

gsharma · ‎12-24-2016

SYMPTOMS: When local disk utilization of multiple node managers goes high beyond a limit, nodes turn “unhealthy” and gets into the "blacklist" not to be used for container/task allocation, hence reducing the effective cluster capacity.

ROOT CAUSE: A burst or rapid rate of submitted jobs with substantial NM usercache resource localization footprint may lead to rapid fill up of the NM local temporary file system with negative consequences in terms of stability. The core issue seems to be the fact that NM continues to localize the resources beyond the maximum local cache size (yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum local cache size is effectively not taken into account when localizing new resources (note that default cache cleanup interval is 10 min controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to sort of self-destruction scenario : Once the filesystem utilization reaches the threshold of 90%, NM will automatically de-register from RM, effectively leading to NM outage.

This issue may offline many NMs simultaneously at the same time and thus is quite critical in terms of platform stability.

SOLUTION: Keep larger/multiple mount points for these local directories. No consensus has been achieved yet in the discussion if HDFS filesystem could be used for these directories.

REFERENCE:

https://issues.apache.org/jira/browse/YARN-5140

Cloudera Community

Community Articles

Node Managers turns unhealthy due to heavy volume of Jobs

Apache YARN

Moving Journal Node directory from one volume to a...

Disconnecting from node due to socket connection t...

Node Manager crashed due to segmentation fault.

Role not started due to unhealthy host static

Node Manager Down

terminated with error org.apache.spark.SparkExcept...

Spark Memory Management

Unable to start Node Manager

PySpark failuer spark.SparkException: Job aborted ...

Write or Append failures in very small Clusters, u...