Created 01-13-2019 06:03 PM
we have hadoop cluster with datanode machines
we notice that CPU load average is high on the DATANODE machines
uptime 17:27:46 up 263 days, 3:39, 3 users, load average: 7.94, 6.66, 7.38
after short verification we notice that there are many delete files ( from lsof ) example
java 193699 yarn 1082r REG 8,16 293715 0 93588652 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014185 (deleted) java 193699 yarn 1191r REG 8,80 292993 0 88474445 /grid/sdf/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014091 (deleted) java 193699 yarn 1205r REG 8,16 2303 0 93588671 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014185_112276263.meta (deleted) java 193699 yarn 1265r REG 8,32 23931 0 25962378 /grid/sdc/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014275 (deleted) java 193699 yarn 1273r REG 8,32 195 0 25962397 /grid/sdc/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014275_112276353.meta (deleted) java 193699 yarn 1307r REG 8,48 66713 0 61461179 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014410 (deleted) java 193699 yarn 1385r REG 8,48 531 0 61461193 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014410_112276488.meta (deleted) java 193699 yarn 1477r REG 8,80 2299 0 88474446 /grid/sdf/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014091_112276169.meta (deleted) java 193699 yarn 1754r REG 8,16 91051 0 93696129 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr37/block_1186014689 (deleted) java 193699 yarn 1760r REG 8,16 719 0 93696130 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr37/block_1186014689_112276769.meta (deleted) java 193699 yarn 1972r REG 8,48 37960 0 61447490 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr39/block_1186015148 (deleted) java 193699 yarn 1976r REG 8,48 307 0 61447491 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr39/block_1186015148_112277228.meta (deleted)
to print only the PID of the deleted file :
lsof +L1 | awk '{print $2}' | sort | uniq 12588 138025
since all files above are not exists as
/grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir39/blk_1186015148_112277228.meta
we killed all the PID
as
kill 12588
and so on
and after we killed all PID the CPU load average decrease as the following
uptime 17:27:46 up 263 days, 3:39, 3 users, load average: 2.24, 4.61, 5.75
what cause the pId to still be up in spite files was deleted
is it ok to kill the PID with
kill PID
Created 01-13-2019 11:44 PM
lsof is used to list all the deleted files which are still on disk due to open file descriptors. Memory is not immediately freed because the running process still has an open file handle to the just-deleted file. After all, if a process is still trying to use a file, you probably don't really want the kernel to get rid of it the file.
On Linux based machines filenames are just pointers (inodes) that point to the memory where the file resides which could be a hard drive or even a RAM-backed filesystem. Each file records the number of links to it: the links can be either the filename (plural, and also every time a file is opened, the process actually holds the "link" to the same space.
Space/memory is physically freed only if there are no links left to the deleted file, So, while the processes have the files still opened, you shouldn't expect to get space/memory back. It's not freed, it's being actively used. This is also one of the reasons that applications should really close the files when they finish using them.
HTH
Created 01-14-2019 06:41 AM
@Geoffrey Shelton Okot , do you mean that we need to check the RAM memory on our data node machines ? , we have on each machine 256G memory and available is 198G , or maybe you want to check other thing?