Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

is it possible that hdfs utilization could be different in two different cluster.

Re: is it possible that hdfs utilization could be different in two different cluster.

Guru

@Matt Foley:

I checked and could not find any difference in filesystem. You can see below output.

Cluster A output.

[aman@clustera ~]$ du -h foo.txt

4.0Kfoo.txt

Cluster B output:

[aman@clusterb ~]$ du -h foo.txt

4.0Kfoo.txt

Re: is it possible that hdfs utilization could be different in two different cluster.

Ah well, worth a try :-)

Re: is it possible that hdfs utilization could be different in two different cluster.

Can you do a ls -r as Artem asked? If you have 33GB in the folder there needs to be more than a 4kb foo file? Also @Matt Foley block size in the underlying file system that is smart. Didn't think of that :-)

Highlighted

Re: is it possible that hdfs utilization could be different in two different cluster.

Can you do a "hdfs dfs -count <path>" on both directories.Just to see if they both have the same amount of files and folders or if one of them has more files and therefore is larger in regards to the size

Re: is it possible that hdfs utilization could be different in two different cluster.

Guru

@Jonas Straub: Yes you are right, I did count on both cluster and found approximate double differences in number of files and dirs. But when I did hadoop fs -ls then I did not get same numbers.

[aman@clustera ~]$ hdfs dfs -count /sample/data/datasets/files/sitecatdatapipeline/

1502 14852 64550284959 /sample/data/datasets/files/sitecatdatapipeline

[aman@clusterb ~]$ hdfs dfs -count /sample/data/datasets/files/sitecatdatapipeline/

585 7738 35444853870 /sample/data/datasets/files/sitecatdatapipeline

Re: is it possible that hdfs utilization could be different in two different cluster.

Mentor

Re: is it possible that hdfs utilization could be different in two different cluster.

I agree with a lot of the ideas in the existing answers, i.e. further verification that the directory contents are really the same at the file system level (not just rows from Hive queries) and the existence of snapshots may cause a difference in reported space consumption.

Re: is it possible that hdfs utilization could be different in two different cluster.

Mentor

Re: is it possible that hdfs utilization could be different in two different cluster.

Guru

@Artem Ervits: I went inside each files and came to know the below output.

I am doing distcp with overwrite option.So I thing distcp is not doing overwrite it is just adding space to that file but data is getting overwrite.

hadoop distcp -delete -overwrite ${HDFS_NM}/${SOURCE_DIR}/${tableName}/feed_date=${currDate}${Target_NM}/${TARGET_DIR}/${tableName}/feed_date=${currDate} 
[aman@clustera ~]$ hadoop fs -du -h /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl
91.7 M   /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04
[aman@clustera ~]$ hadoop fs -du -h  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000000_0
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000001_0
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000002_0
[aman@clusterb ~]$ hadoop fs -du -h /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl
45.9 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04
[aman@clusterb ~]$ hadoop fs -du -h  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000000_0
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000001_0
15.3 M  /sample/data/datasets/files/sitecatdatapipeline/lowes_clickstream_kpis.db/cs_all_emailid_prod_lvl_cart_events_but_not_purchase_tbl/feed_date=2016-02-04/000002_0
Don't have an account?
Coming from Hortonworks? Activate your account here