Support Questions

mike_bronson7 · ‎01-13-2019

we have hadoop cluster with datanode machines

we notice that CPU load average is high on the DATANODE machines

     uptime
     17:27:46 up 263 days,  3:39,  3 users,  load average: 7.94, 6.66, 7.38

after short verification we notice that there are many delete files ( from lsof ) example

    java      193699  yarn 1082r   REG   8,16   293715     0  93588652 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014185 (deleted)
    java      193699  yarn 1191r   REG   8,80   292993     0  88474445 /grid/sdf/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014091 (deleted)
    java      193699  yarn 1205r   REG   8,16     2303     0  93588671 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014185_112276263.meta (deleted)
    java      193699  yarn 1265r   REG   8,32    23931     0  25962378 /grid/sdc/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014275 (deleted)
    java      193699  yarn 1273r   REG   8,32      195     0  25962397 /grid/sdc/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014275_112276353.meta (deleted)
    java      193699  yarn 1307r   REG   8,48    66713     0  61461179 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014410 (deleted)
    java      193699  yarn 1385r   REG   8,48      531     0  61461193 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr36/block_1186014410_112276488.meta (deleted)
    java      193699  yarn 1477r   REG   8,80     2299     0  88474446 /grid/sdf/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr35/block_1186014091_112276169.meta (deleted)
    java      193699  yarn 1754r   REG   8,16    91051     0  93696129 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr37/block_1186014689 (deleted)
    java      193699  yarn 1760r   REG   8,16      719     0  93696130 /grid/sdb/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr37/block_1186014689_112276769.meta (deleted)
    java      193699  yarn 1972r   REG   8,48    37960     0  61447490 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr39/block_1186015148 (deleted)
    java      193699  yarn 1976r   REG   8,48      307     0  61447491 /grid/sdd/hadoop/hdfs/data/TY/HK-428352611-43.21.3.46-1502127526112/TY/finalized/patr15/patr39/block_1186015148_112277228.meta (deleted)

to print only the PID of the deleted file :

lsof +L1 | awk '{print $2}' | sort | uniq   

  12588  
  138025

since all files above are not exists as

    /grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir39/blk_1186015148_112277228.meta

we killed all the PID

as

 kill 12588

and so on

and after we killed all PID the CPU load average decrease as the following

     uptime
     17:27:46 up 263 days,  3:39,  3 users,  load average: 2.24, 4.61, 5.75

what cause the pId to still be up in spite files was deleted

is it ok to kill the PID with

kill PID

Michael-Bronson

Shelton · ‎01-13-2019

@Michael Bronson

lsof is used to list all the deleted files which are still on disk due to open file descriptors. Memory is not immediately freed because the running process still has an open file handle to the just-deleted file. After all, if a process is still trying to use a file, you probably don't really want the kernel to get rid of it the file.

On Linux based machines filenames are just pointers (inodes) that point to the memory where the file resides which could be a hard drive or even a RAM-backed filesystem. Each file records the number of links to it: the links can be either the filename (plural, and also every time a file is opened, the process actually holds the "link" to the same space.

Space/memory is physically freed only if there are no links left to the deleted file, So, while the processes have the files still opened, you shouldn't expect to get space/memory back. It's not freed, it's being actively used. This is also one of the reasons that applications should really close the files when they finish using them.

HTH

mike_bronson7 · ‎01-14-2019

@Geoffrey Shelton Okot , do you mean that we need to check the RAM memory on our data node machines ? , we have on each machine 256G memory and available is 198G , or maybe you want to check other thing?

Michael-Bronson

Cloudera Community

Support Questions

why we get deleted files from lsof