Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Is there a way toanalyze small files(Less than block size) periodically in hdfs? can we automate it?

Explorer

Hi Team,

 

Is there a way to analyze small files and paths on HDFS? is there a way to know ,which user ID has most number of small files?

 

Thanks in advance.

3 REPLIES 3

Master Guru
There are a few options,

You can grab the fsimage periodically with the 'hdfs dfsadmin -fetchImage' command and analyze its delimited or XML outputs via the 'hdfs oiv' tool. The metadata will carry file lengths and ownership information that can help you aggregate it into a report with your record processing software of choice.

Cloudera Enterprise Reports Manager carries summary reports of watched directories: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_dg_reports.html

Cloudera Enterprise Navigator carries HDFS analytics that help show how your HDFS is being used: https://www.cloudera.com/documentation/enterprise/latest/topics/navigator_dashboard.html#concept_cnv...

Cloudera Enterprise Workload eXperience Manager (WXM) includes a small files reporting feature: https://www.cloudera.com/documentation/wxm/latest/topics/wxm_file_size_reporting.html

Explorer

Hi Harsh,

 

Thanks for your information and links. can i have more details on hdfs oiv tool? how to setup and configure to analyze small files? is there any cdh doucument on this?

 

Thanks.

Master Guru
The OIV tool is documented at
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
and
includes some examples. Try its Delimiter related options on a copy of your
HDFS fsimage file and checkout the result.
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.