Created on 06-01-2017 02:05 PM
Small File Offenders
This perl script helps to inform you of the users that have the most "small files". If you are on HDP 2.5+, you do not need a script like this. Why? Because HDP 2.5 has a Zeppelin notebook that will help you identify what users are contributing to small file volume. This is part of SmartSense. Read more here on that. If you are on an older HDP version, you can take a look at this script...
Why Worry About Small Files?
The HDFS NameNode architecture, explained here mentions that "the NameNode keeps an image of the entire file system namespace and file Blockmap in memory." What this means is that every file in HDFS adds some pressure to the memory capacity for the NameNode process. Therefore, a larger max heap for the NameNode Java process will be required as the files system grows.
How to use this script
Before beginning, process the image file into TSV format, as shown in this example command:
hadoop oiv -i /hadoop/hdfs/namesecondary/current/fsimage_0000000000003951761 -o fsimage-delimited.tsv -p Delimited
then pipe the output file (fsimage-delimited.tsv) into this program, eg. cat fsimage-delimited.tsv | fsimage_users.pl
Note: For large fsimage files, you'll probably need to have a larger heap for oiv to run. Set max heap like this (adjust the value to something that makes sense for the host where you run the command):
export HADOOP_OPTS=“-Xmx4096m $HADOOP_OPTS"
Example
HW13177:~ clukasik$ ./fsimage_users.pl ./fsimage-delimited.tsv Limiting output to top 10 items per list. A small file is considered anything less than 134217728. Edit the script to adjust these values. Average File Size (bytes): 0; Users: hive (total size: 0; number of files: 12) yarn (total size: 0; number of files: 8) mapred (total size: 0; number of files: 7) hcat (total size: 0; number of files: 1) anonymous (total size: 0; number of files: 1) Average File Size (bytes): 219.65; Users: ambari-qa (total size: 4393; number of files: 20) Average File Size (bytes): 245.942307692308; Users: hbase (total size: 12789; number of files: 52) Average File Size (bytes): 1096.625; Users: spark (total size: 8773; number of files: 8) Average File Size (bytes): 34471873.6538462; Users: hdfs (total size: 896268715; number of files: 26) Average File Size (bytes): 46705038.25; Users: zeppelin (total size: 186820153; number of files: 4) Users with most small files: hbase: 52 small files hdfs: 23 small files ambari-qa: 20 small files hive: 12 small files spark: 8 small files yarn: 8 small files mapred: 7 small files zeppelin: 3 small files anonymous: 1 small files hcat: 1 small files
Once you identify top offenders, you will need to assess the root cause. It could be bad practices by applications. Engage Hortonworks Professional Services for help tackling the problem!
Repo InfoCreated on 06-01-2017 04:57 PM
Hi Craig, this is indeed a useful tool. Thanks!
AFAIK, HDFS snapshots could increase the small files. Have you taken care of snapshots in your script or have they already been ruled out during the FSImage->TSV phase?
User | Count |
---|---|
149 | |
96 | |
46 | |
41 | |
33 |