Created 11-29-2016 11:59 AM
Hi guys,
I see surfing the internet that the max numbers of files stored in HDFS is equals at Integer.MAX_VALUE of JVM.
Any confirmation that the maximum number of files is this (2.147.483.647)?
Created 11-29-2016 12:08 PM
any file will be present inside a directory. As per the doc [1] the "dfs.namenode.fs-limits.max-directory-items" parameter defines the maximum number of items that a directory may contain.
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
dfs.namenode.fs-limits.max-directory-items | 1048576 | Defines the maximum number of items that a directory may contain. A value of 0 will disable the check. |
Created 11-29-2016 12:20 PM
The following piece of code explains the prediction:
// We need a maximum maximum because by default, PB limits message sizes // to 64MB. This means we can only store approximately 6.7 million entries // per directory, but let's use 6.4 million for some safety. final int MAX_DIR_ITEMS = 64 * 100 * 1000; Preconditions.checkArgument( maxDirItems > 0 && maxDirItems <= MAX_DIR_ITEMS, "Cannot set " + DFSConfigKeys.DFS_NAMENODE_MAX_DIRECTORY_ITEMS_KEY + " to a value less than 1 or greater than " + MAX_DIR_ITEMS);
.
I do not see any "Integer.MAX_VALUE" limitation on the whole HDFS. That does not look right.
Created 11-29-2016 06:58 PM
I believe the only overall HDFS limit is determined by how much memory is available in the namenode.
Created 11-29-2016 07:04 PM
Adding in @Tom Lyon update. You can refer to the following link to check the memory requirement for NameNode based on the number of files. Which talks up to "150-200" million files and the memory settings accordingly. However file count can grow more than that i guess. I did not find any such limitation documented yet.
Created 11-29-2016 10:07 PM
The maximum number of files in HDFS depends on the amount of memory available for the NameNode.
Each file object and each block object takes about 150 bytes of the memory. For example, if you have 10 million files and each file has 1 one block each, then you would need about 3GB of memory for the NameNode.
If I had 10 million files, each using a block, then we would be using: 10 million + 10 million = 20 million * 150 = 3,000,000,000 bytes = 3 GB MEMORY. Keep in mind the NameNode will need memory for other processes. So to support 10 million files then your NameNode will need much more than 3GB of memory.