Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Is HDFS Small File Problem for Its Storage Capacity or Task Execution or Both?

Is HDFS Small File Problem for Its Storage Capacity or Task Execution or Both?

New Contributor

Hi there,

Read some articles for HDFS Small File Problem, but I do not know whether this problem is for HDFS storage capacity or for the map-reduce task execution or both of them.

For example, article The Small File Problem says:

"Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory."

But, if the small files are not accessed frequently and no task running for them, they are just stored there for rare query by path (e.g. we just store images and read the whole content of a file each time, no analysis line by line), why the files will occupy the namenode's memory?

In @David Streever's answer for "How many files is too many on a modern HDP cluster?", he said "I've seen several systems with 400+ million objects represented in the Namenode without issues". So, I think HDFS namenode should not use memory for all the files if the files are not used in tasks. Am I correct?

Can anyone clarify it? :)

Thank you so much!