Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

This article has steps to identify where most of the small file are located in a large HDFS cluster. Below are some articles regarding the small file issues and how to analyze.

https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html

https://community.hortonworks.com/repos/105768/identifying-small-file-offenders.html

https://community.hortonworks.com/articles/142016/analyze-fsimage-from-a-non-smartsense-cluster.html

Smartsense Activity analyzer currently doesn't show the report based on hdfs locations but it shows other useful information like Top N user with Small files, Trends, etc.

In a large HDFS cluster with heavy workload env, it is often hard to locate where the most # of small files are located by using 'fsck' or 'hdfs dfs -ls -R' outputs as they can take a long time to retrieve the data and will have to repeat cmds several times to get the desired output.

I have taken below approach to spot the HDFS locations where most of the small files exist in a large HDFS cluster so users can look into data and find out the origin of the files (like using incorrect table partition key).


Read fsimage and store in HDFS:

- Copy of fsimage file to a different location. (Note: please do not run below cmd on live fsimage file)
hdfs oiv -p Delimited -delimiter "|" -t /tmp/tmpdir/ -i /fsimage_copy -o /fsimage.out
hdfs dfs -put /fsimage.out /user/ambari-qa/

Load to Hive:

Create Tables in Hive:

CREATE TABLE `fsimage_txt1`(
`path` varchar(2000),
`repl` int,
`mdate` date,
`atime` date,
`pblksize` int,
`blkcnt` bigint,
`fsize` bigint,
`nsquota` bigint,
`dsquota` bigint,
`permi` varchar(100),
`usrname` varchar(100),
`grpname` varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://mycluster/apps/hive/warehouse/fsimage_txt'
TBLPROPERTIES (
'last_modified_by'='ambari-qa',
'last_modified_time'='1508277903',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='12458332430',
'transient_lastDdlTime'='1508277903')
CREATE TABLE `file_info1`(
`path` varchar(200),
`fsize` bigint,
`usrname` varchar(100),
`depth` int)
STORED AS ORC

Load data into Hive table;

LOAD DATA INPATH '/user/ambari-qa/fsimage.out' INTO TABLE fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/') as path1, fsize ,usrname, 1 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/') as path1, fsize ,usrname, 2 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/') as path1, fsize ,usrname, 3 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/') as path1, fsize ,usrname, 4 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/') as path1, fsize ,usrname, 5 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/' ,split(path,'/')[6] , '/') as path1, fsize ,usrname, 6 from fsimage_txt1;

Search for dir with highest # of small files:

-- Below query shows max small files with dir depth 2 (HDFS files that are of size < 30MB) 
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and depth = 2 group by path order by cnt desc limit 20;
Sample output
-------------
/user/abc/    13400550
/hadoop/data/  10949499
...
/tmp/	 340400
-- take the dir location with max files (from above output) or your interest and drill down using 'depth' column 
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and path like '/user/abc/%' and depth = 3 group by path order by cnt desc limit 20;
15,834 Views
Comments

@snukavarapu Thanks for the article, this worked great for me. Is this something you keep continuously updated? If so, what's your strategy for keeping the table updated?