Created 11-10-2017 10:54 AM
when we start the data node on one of the workers machine we get:
ERROR datanode.DataNode (DataNode.java:secureMain(2691)) - Exception in secureMain org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 4, volumes configured: 5, volumes failed: 1, volume failures tolerated: 0
and this
WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data
what are the steps that needs to fix it?
Created 11-10-2017 12:31 PM
WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data
The above error can occur sometimes whet the Hard Disk/Filesystem has gone bad and the filesystem is in Read-Only mode. Remounting might help. Please check for any hardware errors. Check the harddisk and remount the Volume.
Also it will be good to see "/etc/hadoop/conf/hdfs-site.xml" property "dfs.datanode.failed.volumes.tolerated" this will set the disk failure tolerance.
<property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>1</value> </property>
.
Created 11-10-2017 12:13 PM
Check if you have write permissions for '/xxxx/sdc/hadoop/hdfs/data' . Change the ownership to hdfs:hadoop
chown hdfs:hadoop /xxxx/sdc/hadoop/hdfs/data
If you are okay with failed volumes ,then you can change 'dfs.datanode.failed.volumes.tolerated' to 1 or another solution is to remove the above directory(/xxxx/sdc/hadoop/hdfs/data) from 'dfs.datanode.data.dir'
Thanks,
Aditya
Created 11-10-2017 12:25 PM
hi Aditya , not clear for me if we change the fs.datanode.failed.volumes.tolerated to then it will affected all workers machine and we have problem only on worker01 , so do you mean that we need to change it to 1 and restart the HDFS service and then return it to 0?
Created 11-10-2017 12:27 PM
on the second approach if we removed the folder /xxxx/sdc/hadoop/hdfs/data on the problematic worker and then we restart the HDFS component on the worker then it will create the folder - data again ?
Created 11-10-2017 12:29 PM
here the permissions :
ls -ltr /xxxxx/sdc/hadoop/hdfs/data/
drwxr-xr-x. 3 hdfs hadoop 4096 current
-rw-r--r--. 1 hdfs hadoop 28 in_use.lock
Created 11-10-2017 12:31 PM
WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data
The above error can occur sometimes whet the Hard Disk/Filesystem has gone bad and the filesystem is in Read-Only mode. Remounting might help. Please check for any hardware errors. Check the harddisk and remount the Volume.
Also it will be good to see "/etc/hadoop/conf/hdfs-site.xml" property "dfs.datanode.failed.volumes.tolerated" this will set the disk failure tolerance.
<property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>1</value> </property>
.
Created 11-10-2017 12:44 PM
hi Jay -
grep dfs.datanode.failed.volumes.tolerated /etc/hadoop/conf/hdfs-site.xml
<name>dfs.datanode.failed.volumes.tolerated</name>
this already set in the xml file
Created 11-10-2017 12:53 PM
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
dfs.datanode.failed.volumes.tolerated | 0 | The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. |
The default value is 0. Please set it to 1 and then try again. Or please fix the Failed Volume.
Created 11-10-2017 12:56 PM
hi Jay , I have idea but not sure about this so I need your advice , on the problematic worker we have extra volume - sdg , and the bad volume is sdf , so maybe we need to umount the sdf and mount the volume sdg in place sdf , and change the DataNode directories from ambari GUI from sdf to sdg - and then restart the component HDFS on the worker , what you think ?
Created 11-10-2017 12:35 PM
1) First solution is to try changing the ownership of the directory and restart. If this works then no need to change anything.
2) If #1 doesn't work and you are ok to remove this volume, then remove the directory from "dfs.datanode.data.dir" and let the value of 'dfs.datanode.failed.volumes.tolerated' remain 0
3) If you do not want to remove this volume and you are okay with this failed volume and continue then set 'dfs.datanode.failed.volumes.tolerated' to 1