Created 04-23-2018 05:09 PM
we have ambari cluster HDP version 2.6.0.1
we have issues on worker02 according to the log - hadoop-hdfs-datanode-worker02.sys65.com.log,
2018-04-21 09:02:53,405 WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data
note - from ambari GUI we can see that Data-node on worker02 is down
we can see from the log - Directory is not writable: /grid/sdc/hadoop/hdfs/data the follwing:
STARTUP_MSG: Starting DataNode STARTUP_MSG: user = hdfs STARTUP_MSG: host = worker02.sys65.com/23.87.23.126 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.3.2.6.0.3-8 STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r c6befa0f1e911140cc815e0bab744a6517abddae; compiled by 'jenkins' on 2017-04-01T21:32Z STARTUP_MSG: java = 1.8.0_112 ************************************************************/ 2018-04-21 09:02:52,854 INFO datanode.DataNode (LogAdapter.java:info(47)) - registered UNIX signal handlers for [TERM, HUP, INT] 2018-04-21 09:02:53,321 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdb/hadoop/hdfs/data/ 2018-04-21 09:02:53,330 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdc/hadoop/hdfs/data/ 2018-04-21 09:02:53,330 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdd/hadoop/hdfs/data/ 2018-04-21 09:02:53,331 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sde/hadoop/hdfs/data/ 2018-04-21 09:02:53,331 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdf/hadoop/hdfs/data/ 2018-04-21 09:02:53,405 WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:124) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:99) at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:128) at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:44) at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:127) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-04-21 09:02:53,410 ERROR datanode.DataNode (DataNode.java:secureMain(2691)) - Exception in secureMain org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 4, volumes configured: 5, volumes failed: 1, volume failures tolerated: 0 at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2583) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2492) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2539) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2684) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2708) 2018-04-21 09:02:53,411 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2018-04-21 09:02:53,414 INFO datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at worker02.sys65.com/23.87.23.126 ************************************************************/
<br>
we checked that:
1. all files and folders under - /grid/sdc/hadoop/hdfs/ are with hdfs:hadoop , and that is OK
2. disk - sdc is read and write (rw,noatime,data=ordered) , and that is OK
we suspect that Hard Disk has gone bad , in this case how we check that?
please advice what the other options to resolve this issue ?
Created 04-25-2018 08:25 PM
The disk is already unusable go-ahead run fsck with a -y option to repair it 🙂 see above
Either way you will have to replace that dirty disk anyways!
Created 04-23-2018 05:58 PM
Could you try umounting and mount that disk? Your disk could have gone bac and the FS is in Read-Only mode Can you also set the failure tolerance to 1
Using Ambari UI--> HDFS-->Configs---Filter in for property "dfs.datanode.failed.volumes.tolerated" set it to 1
Restart stale HDFS services
All should be in order
Created 04-23-2018 07:39 PM
Dear Geoffrey , we do resboot twice before weeks , but this not help ( when we reboot we do actually remount , about dfs.datanode.failed.volumes.tolerated" set it to 1 , we want to set it to 0 ( we not want loose one disk )
Created 04-23-2018 08:17 PM
There could be a couple of reasons, lete check the obvious have you checked SE Linux on this host? if not
$ echo 0 >/selinux/enforce $ cat /selinux/enforce # should output "0"
Read-only filesystem" is not a permissions issue. The mount has become read-only, either because of errors in the filesystem or problems in the device itself. If you run "grep sdc /proc/mounts" you should see it as "ro". There may be some clue as to why in the messages in /var/log/syslog.
Run File system check fsck it will repair some of the errors e.g execute the fsck on an unmounted file system to avoid any data corruption issues. e.g
# fsck /dev/sdc
That should repair the damages.
Created 04-24-2018 04:34 AM
Dear Geoffrey , /grid/sdc hold HDFS filesystem , dose fsck on that disk not risky? , see also - http://fibrevillage.com/storage/658-how-to-use-hdfs-fsck-command-to-identify-corrupted-files
Created 04-24-2018 04:48 AM
what you think about the following steps to fix corrupted files ( I take it from - https://stackoverflow.com/questions/19205057/how-to-fix-corrupt-hdfs-files )
to determine which files are having problems , this ignores lines with nothing but dots and lines talking about replication.
hdfs fsck / | egrep -v '^\.+$' | grep -v eplica
Created 04-24-2018 04:50 AM
Once you find a file that is corrupt hdfs fsck /path/to/corrupt/file -locations -blocks -files
Created 04-24-2018 07:19 AM
Above you are trying to fix corrupt HDFS blocks !! With the default replication factor of 3, you should be okay and below is fixing the filesystem
What is your filesystem type ext4 or? You can run
# e2fsck -y /dev/sdc
You will not have an opportunity to validate the corrections being applied. On the other hand if you run
# e2fsck -n /dev/sdcYou can see what would happen without it actually being applied and if you run you'll be asked each
# e2fsck /dev/sdc
time a significant correction needs to be applied.
Created 04-24-2018 07:55 AM
Dear Geoffrey , the filesystem is ext4
Created 04-24-2018 08:02 AM
Dear Geoffrey , as you know before performing fsck /dev/sdc , we must umount /grid/sdc , or umount -l /grid/sdc , only then we can run fsck /dev/sdc , so can you approve finally the following steps:
1. umount /grid/sdc or umount -l /grid/sdc in case devise is busy
2. fsck /dev/sdc
?