Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Cloudera Employee

1. Fetch the latest FS Image from the Active NameNode:

Look at the (NameNode directories) property in Ambari and copy the latest image to a node with free disk space and memory. (Ex: fsimage_0000000001138083674)

2. Load the FS Image:

On the node where you copied the FS Image. Run the below commands:

export HADOOP_OPTS="-Xms16000m -Xmx16000m $HADOOP_OPTS"
nohup hdfs oiv -i fsimage_0000000001138083674 -o fsimage_0000000001138083674.txt &

Above command will make the FS Image available on a web server (temporary).

3. Create "ls -R" report from the FS Image:

nohup hdfs dfs -ls -R webhdfs://127.0.0.1:5978/ > /data/home/hdfs/lsrreport.txt &

This could take some time.

Copy the data from /data/home/hdfs/lsrreport.txt to hdfs /user/hdfs/lsr/lsrreport.txt

4. Analyze the ls-R output:

Create required table, load data, create view and analyze:

hive> add jar /usr/hdp/2.3.2.0-2950/hive/lib/hive-contrib.jar;
hive> CREATE EXTERNAL TABLE lsr (permissions STRING, replication STRING, owner STRING, ownergroup STRING, size STRING, fileaccessdate STRING, time STRING, file_path STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*)");
hive> load data inpath ‘/user/hdfs/lsr/lsrreport.txt’ overwrite into table lsr;
hive> create view lsr_view as select (case substr(permissions,1,1) when 'd' then 'dir' else 'file' end) as file_type,owner,cast(size as int) as size, fileaccessdate,time,file_path from lsr;

Query 1: Files < 1 MB (Top 100)

hive> select relative_size,fileaccessdate,file_path as total from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small' limit 100;

Query 1: Files < 1 MB (Grouped by Path)

hive> select substr(file_path,1,45) ,count(*) from (select relative_size,fileaccessdate,file_path from (select (case size < 1048576 when true then 'small' else 'large' end) as relative_size,fileaccessdate,file_path from lsr_view where file_type='file') tmp where relative_size='small') tmp2 group by substr(file_path,1,45) order by 2 desc;

Query 1: Files < 1 KByte (Grouped by Owner)

hive> select owner ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file') tmp where relative_size='small' group by owner;

Query 1: Files < 1 KByte (Grouped by Date)

hive> select fileaccessdate ,count(1) from (select (case size < 1024 when true then 'small' else 'large' end) as relative_size,fileaccessdate,owner  from lsr_view where file_type='file' ) tmp where relative_size='small' group by fileaccessdate;
2,367 Views
Comments

Great article!

I was wondering what -o option in "hdfs oiv -i"?

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎07-21-2016 01:51 PM
Updated by:
 
Contributors
Top Kudoed Authors