I've done an upgrade to Cloudera Manager from 5.5.3 to 5.10.0 then upgraded CDH from 5.5.1 to 5.8.4. After these operations, I saw the disk usages of all DataNodes on Hosts->All Hosts page increased. On HDFS file browser and with CLI commands I see almost every directory has double the size before, but I noticed no difference among the file counts, types, names etc.. Same thing when I also check disk usage on Linux terminal. I am a little bit confused and need help to figure out what happened.
Curious to know whether Reinstalling the same Cloudera Manager Server version that you were previously running
solved the issue ?
An update: I was mistaken on some values.
The size values on HDFS file browser and returning from hdfs dfsadmin -report are supposed values. But Cloudera metrics & charts countinue to give increasing values. du -sch output on dfs folders in Linux terminal also gives big numbers. And I noticed the increase have started a couple of days before the upgrade I mentioned, so it's not likely something went wrong with the upgrade.
Recently we have been informed by another HDFS user that they have been splitting the large files into smaller ones for computing performance increase(??) which had me thinking if they're splitting the combined size of TBs of data into smaller ones mostly even smaller than Block Size (128MB) and causing usage on the file system grow more than 3x.
Am I correct on this estimation?
I guess breaking large files into smaller one should not result in space increase because hadoop blocks allows you to use every bit of free space available in a block.
if a block is of 128MB and only 50MB is occupied in it.Hadoop block allows you to use rest of the unused 78MB to store some other data.