We have 4 nodes Hadoop cluster. 2 Master nodes 2 data nodes After sometimes we found that our data nodes are failing. then, we go and see the log section it always tell cannot allocate memory.
<code>HDP 2.3.6 VERSION HAWQ 2.0.0 VERSION linux os : centos 6.0
Getting following error
Data nodes are crashing WITH following logs
<code>os::commit_memory(0x00007fec816ac000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
vm_overcommit ratio is 2
DATANODE heap size 2 GB
Namenode heap size 2 GB
<code>MemTotal: 30946088 kB MemFree: 11252496 kB Buffers: 496376 kB Cached: 11938144 kB SwapCached: 0 kB Active: 15023232 kB Inactive: 3116316 kB Active(anon): 5709860 kB Inactive(anon): 394092 kB Active(file): 9313372 kB Inactive(file): 2722224 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 15728636 kB SwapFree: 15728636 kB Dirty: 280 kB Writeback: 0 kB AnonPages: 5705052 kB Mapped: 461876 kB Shmem: 398936 kB Slab: 803936 kB SReclaimable: 692240 kB SUnreclaim: 111696 kB KernelStack: 33520 kB PageTables: 342840 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 31201680 kB Committed_AS: 26896520 kB VmallocTotal: 34359738367 kB VmallocUsed: 73516 kB VmallocChunk: 34359538628 kB HardwareCorrupted: 0 kB AnonHugePages: 2887680 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6132 kB DirectMap2M: 2091008 kB DirectMap1G: 29360128 kB
Looks like the datanodes might be crashing because of the following setting:
vm_overcommit ratio is 2
Please check the file
- Suggestion: This memory related crash seems to be caused by a system OS setting, the system OS memory overcommit setting is set to 2 (where as it should have been set to 0) as following:
echo 0 > /proc/sys/vm/overcommit_memory
Background of this setting:
Please refer to the following doc to know more about it: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Gui...
overcommit_memoryDefines the conditions that determine whether a large memory request is accepted or denied. There are three possible values for this parameter: 0 — The default setting. The kernel performs heuristic memory overcommit handling by estimating the amount of memory available and failing requests that are blatantly invalid. Unfortunately, since memory is allocated using a heuristic rather than a precise algorithm, this setting can sometimes allow available memory on the system to be overloaded. 1 — The kernel performs no memory overcommit handling. Under this setting, the potential for memory overload is increased, but so is performance for memory-intensive tasks. 2 — The kernel denies requests for memory equal to or larger than the sum of total available swap and the percentage of physical RAM specified in overcommit_ratio. This setting is best if you want a lesser risk of memory overcommitment.
Since you tagged this question with HAWQ, I'm guessing you installed HAWQ on it. One likely reason this is happening is that HAWQ Ambari install will set your datanodes (where HAWQ is installed) to use overcommit of value 2, with a default ratio of 50%, which you're supposed to change based on your memory configs. This ratio should ideally be 90% or more.With 50%, most likely your services didn't get to use half of the datanode RAM.
You will find this config under HAWQ service as a slider control (if overcommit is set to 2). You can either change the overcommit value to 0 on HAWQ Segment Nodes, or set it to 2, will a ratio of 90% or higher. You should update via Ambari instead of direct OS. It is strongly recommended to run HAWQ Master node at least, with overcommit value of 2. You may do this by putting it on a dedicated node and creating a separate config group for HAWQ Master(s). Hope this helps.