Community Articles

Navink · ‎07-15-2024

Prerequisite :

1. We should have the fsimage to continue with this analysis.

2.Access to git

What is small files ?
A small file is one which is significantly smaller than the default Apache Hadoop HDFS default block size (128MB by default in CDH/CDP). One should note that it is expected and inevitable to have some small files on HDFS. These are files like library jars, XML configuration files, temporary staging files, and so on. But when small files become a significant part of datasets, the problems arise.
Problems with small files and HDFS
Small files are a common challenge in the Apache Hadoop world and when not handled with care, they can lead to a number of complications. The Apache Hadoop Distributed File System (HDFS) was developed to store and process large data sets over the range of terabytes and petabytes. However, HDFS stores small files inefficiently, leading to inefficient Namenode memory utilisation and RPC calls, block scanning throughput degradation, and reduced application layer performance.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Why Worry About Small Files?
The HDFS NameNode architecture, explained here mentions that "the NameNode keeps an image of the entire file system namespace and file Blockmap in memory." What this means is that every file in HDFS adds some pressure to the memory capacity for the NameNode process. Therefore, a larger max heap for the NameNode Java process will be required as the files system grows.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

How to use this script
you can git clone https://github.com/clukasikhw/small_file_offenders

1. Get the fsimage for your cluster you would like to do analysis for small files .

2. Process the image file in TSV format to do so we will be using "Delimited" processors of the Offline Image Viewer tool.

Go to under perl directory of cloned source at path - small_file_offenders/src/main/perl and run OIV tool with Delimited processor pointing your fsimage location. This will generate the TSV file which required for running this per script.

Spoiler

perl]# hadoop oiv -i /path/XXXXX/current/fsimage_0000000000000019171 -o fsimage-delimited.tsv -p Delimited

You will get the TSV file generated.

perl]# ls -lart
-rwxr-xr-x. 1 root root 2335 May 24 15:11 fsimage_users.pl
-rw-r--r--. 1 root root 531226 May 24 15:12 fsimage-delimited.tsv

perl]# hadoop oiv -i /path/XXXXX/current/fsimage_0000000000000019171 -o fsimage-delimited.tsv -p Delimited You will get the TSV file generated.perl]# ls -lart-rwxr-xr-x. 1 root root 2335 May 24 15:11 fsimage_users.pl-rw-r--r--. 1 root root 531226 May 24 15:12 fsimage-delimited.tsv

3. Now you can invoke your perl script "fsimage_user.pl" pointing the tsv file .

Spoiler

perl]# ./fsimage_users.pl ./fsimage-delimited.tsv

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Limiting output to top 10 items per list. A small file is considered anything less than 134217728. Edit the script to adjust these values
Average File Size (bytes): 0; Users:
hue (total size: 0; number of files: 535)
hdfs (total size: 0; number of files: 13)
mapred (total size: 0; number of files: 4)
spark (total size: 0; number of files: 3)
impala (total size: 0; number of files: 2)
solr (total size: 0; number of files: 1)
systest (total size: 0; number of files: 1)
schemaregistry (total size: 0; number of files: 1)
admin (total size: 0; number of files: 1)
kafka (total size: 0; number of files: 1)
Average File Size (bytes): 2526.42660550459; Users:
hbase (total size: 550761; number of files: 218)
Average File Size (bytes): 28531.875; Users:
livy (total size: 456510; number of files: 16)
Average File Size (bytes): 2438570.8308329; Users:
oozie (total size: 4713757416; number of files: 1933)
Average File Size (bytes): 4568935.69565217; Users:
hive (total size: 735598647; number of files: 161)
Average File Size (bytes): 209409621.25; Users:
yarn (total size: 1675276970; number of files:
Average File Size (bytes): 469965475; Users:
tez (total size: 939930950; number of files: 2)

Users with most small files:
oozie: 1927 small files
hue: 535 small files
hbase: 218 small files
hive: 161 small files
livy: 16 small files
hdfs: 13 small files
yarn: 6 small files
mapred: 4 small files
spark: 3 small files
impala: 2 small files

perl]# ./fsimage_users.pl ./fsimage-delimited.tsvperl: warning: Setting locale failed.perl: warning: Please check that your locale settings:LANGUAGE = (unset),LC_ALL = (unset),LC_CTYPE = "UTF-8",LANG = "en_US.UTF-8"are supported and installed on your system.perl: warning: Falling back to the standard locale ("C").Limiting output to top 10 items per list. A small file is considered anything less than 134217728. Edit the script to adjust these valuesAverage File Size (bytes): 0; Users: hue (total size: 0; number of files: 535)hdfs (total size: 0; number of files: 13)mapred (total size: 0; number of files: 4)spark (total size: 0; number of files: 3)impala (total size: 0; number of files: 2)solr (total size: 0; number of files: 1)systest (total size: 0; number of files: 1)schemaregistry (total size: 0; number of files: 1)admin (total size: 0; number of files: 1)kafka (total size: 0; number of files: 1)Average File Size (bytes): 2526.42660550459; Users:hbase (total size: 550761; number of files: 218)Average File Size (bytes): 28531.875; Users:livy (total size: 456510; number of files: 16)Average File Size (bytes): 2438570.8308329; Users:oozie (total size: 4713757416; number of files: 1933)Average File Size (bytes): 4568935.69565217; Users:hive (total size: 735598647; number of files: 161)Average File Size (bytes): 209409621.25; Users:yarn (total size: 1675276970; number of files:Average File Size (bytes): 469965475; Users:tez (total size: 939930950; number of files: 2)Users with most small files:oozie: 1927 small fileshue: 535 small fileshbase: 218 small fileshive: 161 small fileslivy: 16 small fileshdfs: 13 small filesyarn: 6 small filesmapred: 4 small filesspark: 3 small filesimpala: 2 small files

The default block size is being used = 128 MB . This way we can use this script to identify the user having highest number of small file and make a decision based on cluster behaviour like which user is contributing more to small file issue if present.
Reference : https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-c...

Cloudera Community

Community Articles

How to identify in cdp cluster having small files issue or the users that have the most "small files" ?

HDFS