Community Articles
Find and share helpful community-sourced technical articles.
Labels (1)
Cloudera Employee

1. Fetch the latest FS Image from the Active NameNode:

Look at the (NameNode directories) property in Ambari and copy the latest image to a node with free disk space and memory. (Ex: fsimage_0000000001138083674)

2. Load the FS Image:

On the node where you copied the FS Image. Run the below commands:

export HADOOP_OPTS="-Xms16000m -Xmx16000m $HADOOP_OPTS"
nohup hdfs oiv -i fsimage_0000000001138083674 -o fsimage_0000000001138083674.txt &

Above command will make the FS Image available on a web server (temporary).

3. Create "ls -R" report from the FS Image:

nohup hdfs dfs -ls -R webhdfs://127.0.0.1:5978/ > /data/home/hdfs/lsrreport.txt &

This could take some time.

Copy the data from /data/home/hdfs/lsrreport.txt to hdfs /user/hdfs/lsr/lsrreport.txt

4. Analyze the ls-R output:

Create required table, load data, create view and analyze:

hive> add jar /usr/hdp/2.3.2.0-2950/hive/lib/hive-contrib.jar;
hive> CREATE EXTERNAL TABLE lsr (permissions STRING, replication STRING, owner STRING, ownergroup STRING, size STRING, fileaccessdate STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*)");
hive> load data inpath ‘/user/hdfs/lsr/lsrreport.txt’ overwrite into table lsr;
hive> create view lsr_view as select (case substr(permissions,1,1) when 'd' then 'dir' else 'file' end) as file_type,owner,cast(size as int) as size, fileaccessdate,time,file_path from lsr;

Query 1: Files < 1 MB (Top 100)

hive> select relative_size,fileaccessdate,file_path as total from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small' limit 100;

Query 1: Files < 1 MB (Grouped by Path)

hive> select substr(file_path,1,45) ,count(*) from (select relative_size,fileaccessdate,file_path from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small') tmp2 group by substr(file_path,1,45) order by 2 desc;

Query 1: Files < 1 KByte (Grouped by Owner)

hive> select owner ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file') tmp where relative_size='small' group by owner;

Query 1: Files < 1 KByte (Grouped by Date)

hive> select fileaccessdate ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file' ) tmp where relative_size='small' group by fileaccessdate;
3,888 Views
Comments

Great article!

I was wondering what -o option in "hdfs oiv -i"?