Created 05-08-2021 12:19 AM
Hi,
I'm running a 6-node Hadoop Cluster, which version is HDP 3.1.4, and faced a weird situation so could anyone please help me out.
Checking on Ambari WebUI, NodeManagers which is consisted of 3 nodes seem just fine. However when we push a application into yarn, all the app failed.
Investigating the situation and log files, I found one of the NodeManager node's WebUI not responding and emitting the following log.
WARN org.apache.hadoop.util.SysInfoLinux: Couldn't read /proc/meminfo: can't determine memory settings
By restarting the NodeManager Service, this situation is resolved for a moment but seems reproducible.
Does anyone faced same problem? Any comment is highly appreciated.
Thank you in advance.
Best Regards,
Yosuke
Created 05-09-2021 03:56 AM
The /proc/meminfo is used by to report the amount of free and used memory (both physical and swap) on the system as well as the shared memory and buffers used by the kernel.
$ cat /proc/meminfo >> /tmp/faultynode.txt
This is related to memory settings can you compare or check Transparent Hugepages (THP) settings between the faulty and the working NM/DN's?
My advice is to check the prepare the Environment steps for your HDP.
Extract from my virtual box
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
Please revert
Created 05-11-2021 01:48 PM
Thank you for the response.
I haven't checked the settings yet, however we don't configure THP setting to disabled. I googled and found some article saying THP is not for Hadoop workload like below.
I think you pointed out that our NodeManager went wrong due to the some workload happened and our node running NodeManager couldn't deal with it. Is my understanding correct?
I will configure this in our test environment first. Not knowing how this situation happened, I need some time to verify. If any progress or additional query, I will post again.
Thank you very much again.
Best regards,
Yosuke
Created 05-11-2021 11:18 PM