Community Articles

snukavarapu · ‎10-19-2017

This article has steps to identify where most of the small file are located in a large HDFS cluster. Below are some articles regarding the small file issues and how to analyze.

https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html

https://community.hortonworks.com/repos/105768/identifying-small-file-offenders.html

https://community.hortonworks.com/articles/142016/analyze-fsimage-from-a-non-smartsense-cluster.html

Smartsense Activity analyzer currently doesn't show the report based on hdfs locations but it shows other useful information like Top N user with Small files, Trends, etc.

In a large HDFS cluster with heavy workload env, it is often hard to locate where the most # of small files are located by using 'fsck' or 'hdfs dfs -ls -R' outputs as they can take a long time to retrieve the data and will have to repeat cmds several times to get the desired output.

I have taken below approach to spot the HDFS locations where most of the small files exist in a large HDFS cluster so users can look into data and find out the origin of the files (like using incorrect table partition key).

Read fsimage and store in HDFS:

- Copy of fsimage file to a different location. (Note: please do not run below cmd on live fsimage file)
hdfs oiv -p Delimited -delimiter "|" -t /tmp/tmpdir/ -i /fsimage_copy -o /fsimage.out
hdfs dfs -put /fsimage.out /user/ambari-qa/

Load to Hive:

Create Tables in Hive:

CREATE TABLE `fsimage_txt1`(
`path` varchar(2000),
`repl` int,
`mdate` date,
`atime` date,
`pblksize` int,
`blkcnt` bigint,
`fsize` bigint,
`nsquota` bigint,
`dsquota` bigint,
`permi` varchar(100),
`usrname` varchar(100),
`grpname` varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://mycluster/apps/hive/warehouse/fsimage_txt'
TBLPROPERTIES (
'last_modified_by'='ambari-qa',
'last_modified_time'='1508277903',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='12458332430',
'transient_lastDdlTime'='1508277903')
CREATE TABLE `file_info1`(
`path` varchar(200),
`fsize` bigint,
`usrname` varchar(100),
`depth` int)
STORED AS ORC

Load data into Hive table;

LOAD DATA INPATH '/user/ambari-qa/fsimage.out' INTO TABLE fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/') as path1, fsize ,usrname, 1 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/') as path1, fsize ,usrname, 2 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/') as path1, fsize ,usrname, 3 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/') as path1, fsize ,usrname, 4 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/') as path1, fsize ,usrname, 5 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/' ,split(path,'/')[6] , '/') as path1, fsize ,usrname, 6 from fsimage_txt1;

Search for dir with highest # of small files:

-- Below query shows max small files with dir depth 2 (HDFS files that are of size < 30MB) 
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and depth = 2 group by path order by cnt desc limit 20;
Sample output
-------------
/user/abc/    13400550
/hadoop/data/  10949499
...
/tmp/	 340400
-- take the dir location with max files (from above output) or your interest and drill down using 'depth' column 
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and path like '/user/abc/%' and depth = 3 group by path order by cnt desc limit 20;

josh_nicholson · ‎08-28-2018

@snukavarapu Thanks for the article, this worked great for me. Is this something you keep continuously updated? If so, what's your strategy for keeping the table updated?

Cloudera Community

Community Articles

Identify where most of the small file are located in a large HDFS cluster

Apache Hadoop

Re: Identify where most of the small file are located in a large HDFS cluster