➤ It sounds like you are encountering a common issue in HDFS where the metadata overhead and block minimums are causing a massive discrepancy between your actual data size and your disk utilization.
While 650 files at 4MB each technically equals 2.6GB of data, the way HDFS manages these on your physical disks (especially in smaller or test clusters) can lead to unexpected storage consumption.
➤ Root Causes of the 100% Utilization
1. Reserved Space and "Non-DFS Used"
HDFS does not have access to the entire disk. By default, Hadoop reserves a portion of the disk for the OS and non-Hadoop data (usually defined by dfs.datanode.du.reserved). If you are running on small disks (e.g., 20GB–50GB), the combination of your data, logs, and reserved space can quickly hit the 100% threshold.
2. Local Filesystem Block Overheads
Even though your HDFS block size is 4MB, your underlying OS filesystem (EXT4 or XFS) uses its own block size (usually 4KB). However, the metadata for 650 individual files, their checksums (.meta files), and the edit logs on the NameNode create a "death by a thousand cuts" scenario for small disks.
3. Log Accumulation
Check /var/log/hadoop or your configured log directory. In HDFS 3.3.5, if a cluster is struggling with space, the DataNodes and NameNodes generate massive amounts of "Heartbeat" and "Disk Full" logs, which consume the remaining Non-DFS space, pushing the disk to 100%.
➤How to Tackle the Situation
Step 1: Identify Where the Space Is Going
Run the following command to see if the space is taken by HDFS data or other files:
$ hdfs dfsadmin -report
DFS Used: Space taken by your 650 files.
Non-DFS Used: Space taken by logs, OS, and other applications. If this is high, your logs are the culprit.
Step 2: Clear Logs and Temporary Data
If "Non-DFS Used" is high, clear out the Hadoop log directory:
# Example path
rm -rf /var/log/hadoop/hdfs/*.log.*
rm -rf /var/log/hadoop/hdfs/*.out.*
Step 3: Adjust the "Disk Checked" Thresholds
By default, a DataNode stops working if the disk is 95% full. If you are in a test environment and need to squeeze out more space, you can lower the reserved space in hdfs-site.xml:
<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
</property>
Step 4: Combine Small Files (Long-term Fix)
HDFS is designed for large files.1 650 files of 4MB are considered "Small Files."
The Problem: Every file, regardless of size, takes up roughly 150 bytes of RAM on the NameNode and creates separate metadata entries.
The Solution: Use the getmerge command or a MapReduce/Spark job to combine these 650 files into 2 or 3 larger files (e.g., 1GB each).