Support Questions

Find answers, ask questions, and share your expertise

cant start DataNode from ambari cluster

avatar

when we start the data node on one of the workers machine we get:

ERROR datanode.DataNode (DataNode.java:secureMain(2691)) - Exception in secureMain org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 4, volumes configured: 5, volumes failed: 1, volume failures tolerated: 0

and this

WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data

what are the steps that needs to fix it?

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@Michael Bronson

WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking 
StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data

The above error can occur sometimes whet the Hard Disk/Filesystem has gone bad and the filesystem is in Read-Only mode. Remounting might help. Please check for any hardware errors. Check the harddisk and remount the Volume.

Also it will be good to see "/etc/hadoop/conf/hdfs-site.xml" property "dfs.datanode.failed.volumes.tolerated" this will set the disk failure tolerance.

<property>
     <name>dfs.datanode.failed.volumes.tolerated</name>
     <value>1</value>
</property> 

.

View solution in original post

13 REPLIES 13

avatar
Super Guru

@Michael Bronson

Check if you have write permissions for '/xxxx/sdc/hadoop/hdfs/data' . Change the ownership to hdfs:hadoop

chown hdfs:hadoop /xxxx/sdc/hadoop/hdfs/data

If you are okay with failed volumes ,then you can change 'dfs.datanode.failed.volumes.tolerated' to 1 or another solution is to remove the above directory(/xxxx/sdc/hadoop/hdfs/data) from 'dfs.datanode.data.dir'

Thanks,

Aditya

avatar

hi Aditya , not clear for me if we change the fs.datanode.failed.volumes.tolerated to then it will affected all workers machine and we have problem only on worker01 , so do you mean that we need to change it to 1 and restart the HDFS service and then return it to 0?

Michael-Bronson

avatar

on the second approach if we removed the folder /xxxx/sdc/hadoop/hdfs/data on the problematic worker and then we restart the HDFS component on the worker then it will create the folder - data again ?

Michael-Bronson

avatar

here the permissions :

ls -ltr /xxxxx/sdc/hadoop/hdfs/data/

drwxr-xr-x. 3 hdfs hadoop 4096 current

-rw-r--r--. 1 hdfs hadoop 28 in_use.lock

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking 
StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /xxxx/sdc/hadoop/hdfs/data

The above error can occur sometimes whet the Hard Disk/Filesystem has gone bad and the filesystem is in Read-Only mode. Remounting might help. Please check for any hardware errors. Check the harddisk and remount the Volume.

Also it will be good to see "/etc/hadoop/conf/hdfs-site.xml" property "dfs.datanode.failed.volumes.tolerated" this will set the disk failure tolerance.

<property>
     <name>dfs.datanode.failed.volumes.tolerated</name>
     <value>1</value>
</property> 

.

avatar

hi Jay -

grep dfs.datanode.failed.volumes.tolerated /etc/hadoop/conf/hdfs-site.xml

<name>dfs.datanode.failed.volumes.tolerated</name>

this already set in the xml file

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


dfs.datanode.failed.volumes.tolerated
0The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown.

The default value is 0. Please set it to 1 and then try again. Or please fix the Failed Volume.

.

avatar

hi Jay , I have idea but not sure about this so I need your advice , on the problematic worker we have extra volume - sdg , and the bad volume is sdf , so maybe we need to umount the sdf and mount the volume sdg in place sdf , and change the DataNode directories from ambari GUI from sdf to sdg - and then restart the component HDFS on the worker , what you think ?

Michael-Bronson

avatar
Super Guru

@Michael Bronson,

1) First solution is to try changing the ownership of the directory and restart. If this works then no need to change anything.

2) If #1 doesn't work and you are ok to remove this volume, then remove the directory from "dfs.datanode.data.dir" and let the value of 'dfs.datanode.failed.volumes.tolerated' remain 0

3) If you do not want to remove this volume and you are okay with this failed volume and continue then set 'dfs.datanode.failed.volumes.tolerated' to 1