Support Questions

Find answers, ask questions, and share your expertise

hadoop -count returning wrong result

avatar
Explorer

Hi All,

 

As part of our cloudera BDR backup & restore validation,we use the below commad to  verify the back up and restored files are same.

 

hdfs dfs -count /data

 

before start the replication schedule . my /data directory in source cluster contains 6982 directories and 10,887 files. Please see the result of the hdfs count command 

[user@example ~]$ hdfs dfs -count /data
6982 10,887 11897305288 /data

             &

[user@example~]$ hdfs dfs -ls -R /data | wc -l
17869

 

 we had run replication(via distcp command line)maually, due to some space crunch on the remote server the distcp job was failed. then we run below command to check the hdfs count 

 

[user@example tmp]$ hdfs dfs -count /data
6982 21756 11940958360 /data

 

[user@example tmp]$ hdfs dfs -ls -R /data | wc -l
17869

 

There was a devation in the file count before the operation,almost the file count increased double. However

ls -R result giving the actual count (6982 +10,887).

 

Ideally the output of hdfs dfs -count command should returns with 10,887 files and 6982 directories.

 

What could be the reason for this inconsistent result? We did restart the cluster suspecting some chache but despite that the counts mentioned above was consitent.

 

Thanks in advance,

Kathik

6 REPLIES 6

avatar
Champion
If you have enterprise version you could download the disk usage report and verify it and see which folder has the most of the files .

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_dg_disk_usage_reports.html

avatar

Hi,  I think it is related to the snapshots or hidden directories. Maybe the distcp is preparing a snapshot, and as it failed, it left these temporary objects in HDFS. 

avatar
Contributor

I encountred the same issue hdfs dfs -count return incorrect file count. The directory has 76 files but the -count report 77 files. The count CONTENT_SIZE match total sum of individual files in the directory.

 

I think it is a bug in the -count operation report incorrect file count.

Any comments from experts here?

 

$ hdfs dfs -count -v /PROJECTS/flume_data/dirname1/2018/11/27/12
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
           1           77              78855 /PROJECTS/flume_data/dirname1/2018/11/27/12
		   

$ hdfs dfs -ls -R /PROJECTS/flume_data/dirname1/2018/11/27/12 | wc -l
76


$ hdfs dfs -du -s -x /PROJECTS/flume_data/dirname1/2018/11/27/12
78855  236565  /PROJECTS/flume_data/dirname1/2018/11/27/12


##manual sum individual file size in the directory
$ hdfs dfs -du -x /PROJECTS/flume_data/dirname1/2018/11/27/12 | awk {'print $1'} | sed 's/$/+/g' | tr -d '\n' | sed 's/$/0\n/' | bc
78855

$ hdfs dfs -du -x /PROJECTS/flume_data/dirname1/2018/11/27/12 | wc -l
76

 

 

avatar
Champion

do you have HA configured for Namenode ? by any chance . Also you double check the results in Namenode UI 

avatar
Contributor

Namenode UI will provide total files and directories is there a way we can see the number of files by directory using namenode UI?

 

 

avatar
New Contributor

I had exactly the same issue and turned out that the count includes also snapshot. To check if that's the case one can add -x option in the count, e.g.:

 

hdfs dfs -count -v -h -x   /user/hive/warehouse/my_schema.db/*