Community Articles

smishra1 · ‎07-21-2016

1. Fetch the latest FS Image from the Active NameNode:

Look at the (NameNode directories) property in Ambari and copy the latest image to a node with free disk space and memory. (Ex: fsimage_0000000001138083674)

2. Load the FS Image:

On the node where you copied the FS Image. Run the below commands:

export HADOOP_OPTS="-Xms16000m -Xmx16000m $HADOOP_OPTS"

nohup hdfs oiv -i fsimage_0000000001138083674 -o fsimage_0000000001138083674.txt &

Above command will make the FS Image available on a web server (temporary).

3. Create "ls -R" report from the FS Image:

nohup hdfs dfs -ls -R webhdfs://127.0.0.1:5978/ > /data/home/hdfs/lsrreport.txt &

This could take some time.

Copy the data from /data/home/hdfs/lsrreport.txt to hdfs /user/hdfs/lsr/lsrreport.txt

4. Analyze the ls-R output:

Create required table, load data, create view and analyze:

hive> add jar /usr/hdp/2.3.2.0-2950/hive/lib/hive-contrib.jar;

hive> CREATE EXTERNAL TABLE lsr (permissions STRING, replication STRING, owner STRING, ownergroup STRING, size STRING, fileaccessdate STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*)");

hive> load data inpath ‘/user/hdfs/lsr/lsrreport.txt’ overwrite into table lsr;

hive> create view lsr_view as select (case substr(permissions,1,1) when 'd' then 'dir' else 'file' end) as file_type,owner,cast(size as int) as size, fileaccessdate,time,file_path from lsr;

Query 1: Files < 1 MB (Top 100)

hive> select relative_size,fileaccessdate,file_path as total from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small' limit 100;

Query 1: Files < 1 MB (Grouped by Path)

hive> select substr(file_path,1,45) ,count(*) from (select relative_size,fileaccessdate,file_path from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small') tmp2 group by substr(file_path,1,45) order by 2 desc;

Query 1: Files < 1 KByte (Grouped by Owner)

hive> select owner ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file') tmp where relative_size='small' group by owner;

Query 1: Files < 1 KByte (Grouped by Date)

hive> select fileaccessdate ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file' ) tmp where relative_size='small' group by fileaccessdate;

hosako · ‎07-26-2016

Great article!

I was wondering what -o option in "hdfs oiv -i"?

Cloudera Community

Community Articles

Analyze Small FIle in HDFS

Apache Hadoop

Re: Analyze Small FIle in HDFS

How to identify in cdp cluster having small files ...

Analyzing images in HDF 2.0 using TensorFlow

Small file in hadoop

hive Insert to Dynamic Partition query Generating ...

How to display query metrics of Analyzer/Optimizer...

Hive writing many small csv files to HDFS

RDBMS to Hive using NiFi (small-medium tables)

Merge small files in pyspark for Hive table

Activity Analyzer is not reporting any data for HD...

Uploading Files for Cloudera Support - alternate m...