Created on 09-03-2018 05:05 PM - edited 08-18-2019 02:05 AM
hi all
we have ambari cluster version 2.6.1 & HDP version 2.6.4
from the dashboard we can see that HDFS DISK Usage is almost 90%
but all data-node disk are around 90%
so why HDFS show 90% , while datanode disk are only 50%
/dev/sdc 20G 11G 8.7G 56% /data/sdc /dev/sde 20G 11G 8.7G 56% /data/sde /dev/sdd 20G 11G 9.0G 55% /data/sdd /dev/sdb 20G 8.9G 11G 46% /data/sdb
is it problem of fine-tune ? or else
we also performed re-balance from the ambari GUI but this isn't help
Created on 09-05-2018 12:56 AM - edited 08-18-2019 02:05 AM
As the NameNode Report and UI (including ambari UI) shows that your DFS used is reaching almsot 87% to 90% hence it will be really good if you can increase the DFS capacity.
In order to understand in detail about the Non DFS Used = Configured Capacity - DFS Remaining - DFS Used
YOu can refer to the following article which aims at explaining the concepts of Configured Capacity, Present Capacity, DFS Used,DFS Remaining, Non DFS Used, in HDFS. The diagram below clearly explains these output space parameters assuming HDFS as a single disk.
https://community.hortonworks.com/articles/98936/details-of-the-output-hdfs-dfsadmin-report.html
.
The above is one of the best article to understand the DFS and Non-DFS calculations and remedy.
You add capacity by giving dfs.datanode.data.dir more mount points or directories. In Ambari that section of configs is I believe to the right depending the version of Ambari or in advanced section, the property is in hdfs-site.xml. the more new disk you provide through comma separated list the more capacity you will have. Preferably every machine should have same disk and mount point structure.
.
Created on 09-04-2018 01:35 AM - edited 08-18-2019 02:05 AM
The HDFS dashboard metrics widget "HDFS Disk Usage" shows: \The percentage of distributed file system (DFS) used, which is a combination of DFS and non-DFS used.
So can you just put your mouse over the "HDFS Disk Usage" widget and then see what is the different values do you see there for "DFS Used" , "non DFS Used" and Remaining. You should see something like following:
.
Created on 09-04-2018 04:44 AM - edited 08-18-2019 02:05 AM
@Jay , this is what we got: ( so this is like 88% used ) , and regarding my question , how it can be 88% when disk are ~50%
Created 09-04-2018 05:26 AM
What do you see when you run the following command?
# su - hdfs -c " hdfs dfsadmin -report | grep 'DFS Used%'
(OR)
Please check the "DFS Used" shown in the NameNode UI to verify if ambari is shoiwng the same data or different? : http://$ACTIVE_NAMENODE:50070/dfshealth.html#tab-overview
Created 09-04-2018 05:25 AM
@Jay , I want to note that we have 4 datenode machines ( 4 workers machines ) and each worker have 4 disks with 20G
Created 09-04-2018 05:29 AM
@jay , we got the following results
su - hdfs -c " hdfs dfsadmin -report | grep 'DFS Used%' " DFS Used%: 87.38% DFS Used%: 88.48% DFS Used%: 87.00% DFS Used%: 84.70% DFS Used%: 87.93%
Created 09-04-2018 05:43 AM
@Jay another info from my datanode about the disk use
datanode1 47% disk1 64% disk2 48% disk3 49% disk4 datanode2 44% 53% 61% 44% datanode3 55% 46% 45% 91% datanode4 63% 45% 49% 46%
Created 09-04-2018 06:16 AM
@Jay do you need other info?
Created on 09-04-2018 09:36 AM - edited 08-18-2019 02:05 AM
Created 09-04-2018 10:05 AM
As "HDFS Disk Usage" shows: The percentage of distributed file system (DFS) used, which is a combination of DFS and non-DFS used.
The NameNode commands/UI shows that the DFS Used is around 87.06% and Non DFS Used is 0%
So which is almost same which ambari is showing like almost 88% (DFS + Non DFS Usage) so there seems to be no contradiction to me.
Please let us know what is the value you are expecting.
Created 09-04-2018 10:16 AM
@jay , yes I agree , but how it can be the datanode disk are with capacity of 50% and HDFS show 88% ? , and why we not used all the size of datanode disks ? , I am really not understant that
Created 09-05-2018 09:01 PM
hi
per your request , this is the file
<name>dfs.datanode.data.dir</name> <value>/data/sdb/hadoop/hdfs/data,/data/sdc/hadoop/hdfs/data,/data/sdd/hadoop/hdfs/data,/data/sde/hadoop/hdfs/data</value> -- <name>dfs.datanode.data.dir.perm</name> <value>750</value>
Created 09-06-2018 06:48 AM
Looks good to me. Just do one more check, what was the config that getting loaded into NN in-memory?
http://<active nn host>:50070/conf
and find it "dfs.datanode.data.dir".
You must share us the logs. No point in going with assumptions. 🙂
Created 09-06-2018 07:04 AM
this is the relevant info from the file ( long file , but I think you want to look on the relevant disks )
<value> /data/sdb/hadoop/hdfs/data,/data/sdc/hadoop/hdfs/data,/data/sdd/hadoop/hdfs/data,/data/sde/hadoop/hdfs/data </value> <source>hdfs-site.xml</source> </property> <value> /data/sdb/hadoop/hdfs/data,/data/sdc/hadoop/hdfs/data,/data/sdd/hadoop/hdfs/data,/data/sde/hadoop/hdfs/data </value> <source>hdfs-site.xml</source> </property>
Created 09-06-2018 07:06 AM
about the logs , please remind me what the logs that you want to look for ?
Created 09-06-2018 07:12 AM
Thanks for the confirmation. I need namenode and datanode log after HDFS service restart.
Created 09-06-2018 07:23 AM
because the logs are huge , do you want to search specific sting in the logs ?
Created 09-06-2018 07:28 AM
You can do tail in namenode and datanode log, also you can redirect output to dummy log file during restart.
#tailf <namenode log> >/tmp/namenode-`hostname`.log
#tailf <datanode log> >/tmp/datanode-`hostname`.log
Created 09-04-2018 10:16 AM
HDFS originally splits and stores the data in blocks. Each block is 64MB or 128MB by default based on your HDFS version. Consider a file which is of size (2MB) is stored in a block. The remaining 62MB (considering the default size to be 64MB) is not used by HDFS. Which means here the HDFS used space is 64MB but the actual hard disk used space is 2MB.
Hope this is relevant to what you are asking.
Created 09-04-2018 10:32 AM
@Rabbit , so in our case based on that 88% is HDFS uses , the only one option to be with more HDFS space is to add adisk in each datanode ? am I right ?
Created 09-04-2018 10:54 AM
Yes possibly @Michael Bronson. You could also check "Trash" files size as @Geoffrey Shelton Okot suggests.
In general it is advised not to store too many small files in HDFS. In general, HDFS is good for storing large files.