Created on 10-19-2017 08:13 PM
This article has steps to identify where most of the small file are located in a large HDFS cluster. Below are some articles regarding the small file issues and how to analyze.
https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html
https://community.hortonworks.com/repos/105768/identifying-small-file-offenders.html
https://community.hortonworks.com/articles/142016/analyze-fsimage-from-a-non-smartsense-cluster.html
Smartsense Activity analyzer currently doesn't show the report based on hdfs locations but it shows other useful information like Top N user with Small files, Trends, etc.
In a large HDFS cluster with heavy workload env, it is often hard to locate where the most # of small files are located by using 'fsck' or 'hdfs dfs -ls -R' outputs as they can take a long time to retrieve the data and will have to repeat cmds several times to get the desired output.
I have taken below approach to spot the HDFS locations where most of the small files exist in a large HDFS cluster so users can look into data and find out the origin of the files (like using incorrect table partition key).
Read fsimage and store in HDFS:
- Copy of fsimage file to a different location. (Note: please do not run below cmd on live fsimage file) hdfs oiv -p Delimited -delimiter "|" -t /tmp/tmpdir/ -i /fsimage_copy -o /fsimage.out hdfs dfs -put /fsimage.out /user/ambari-qa/
Load to Hive:
Create Tables in Hive:
CREATE TABLE `fsimage_txt1`( `path` varchar(2000), `repl` int, `mdate` date, `atime` date, `pblksize` int, `blkcnt` bigint, `fsize` bigint, `nsquota` bigint, `dsquota` bigint, `permi` varchar(100), `usrname` varchar(100), `grpname` varchar(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://mycluster/apps/hive/warehouse/fsimage_txt' TBLPROPERTIES ( 'last_modified_by'='ambari-qa', 'last_modified_time'='1508277903', 'numFiles'='1', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='12458332430', 'transient_lastDdlTime'='1508277903') CREATE TABLE `file_info1`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100), `depth` int) STORED AS ORC
Load data into Hive table;
LOAD DATA INPATH '/user/ambari-qa/fsimage.out' INTO TABLE fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/') as path1, fsize ,usrname, 1 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/') as path1, fsize ,usrname, 2 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/') as path1, fsize ,usrname, 3 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/') as path1, fsize ,usrname, 4 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/') as path1, fsize ,usrname, 5 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/' ,split(path,'/')[6] , '/') as path1, fsize ,usrname, 6 from fsimage_txt1;
Search for dir with highest # of small files:
-- Below query shows max small files with dir depth 2 (HDFS files that are of size < 30MB) select path, count(1) as cnt from file_info1 where fsize <= 30000000 and depth = 2 group by path order by cnt desc limit 20; Sample output ------------- /user/abc/ 13400550 /hadoop/data/ 10949499 ... /tmp/ 340400 -- take the dir location with max files (from above output) or your interest and drill down using 'depth' column select path, count(1) as cnt from file_info1 where fsize <= 30000000 and path like '/user/abc/%' and depth = 3 group by path order by cnt desc limit 20;
Created on 08-28-2018 06:42 PM
@snukavarapu Thanks for the article, this worked great for me. Is this something you keep continuously updated? If so, what's your strategy for keeping the table updated?