Created 04-23-2018 05:09 PM
we have ambari cluster HDP version 2.6.0.1
we have issues on worker02 according to the log - hadoop-hdfs-datanode-worker02.sys65.com.log,
2018-04-21 09:02:53,405 WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data
note - from ambari GUI we can see that Data-node on worker02 is down
we can see from the log - Directory is not writable: /grid/sdc/hadoop/hdfs/data the follwing:
STARTUP_MSG: Starting DataNode STARTUP_MSG: user = hdfs STARTUP_MSG: host = worker02.sys65.com/23.87.23.126 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.3.2.6.0.3-8 STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r c6befa0f1e911140cc815e0bab744a6517abddae; compiled by 'jenkins' on 2017-04-01T21:32Z STARTUP_MSG: java = 1.8.0_112 ************************************************************/ 2018-04-21 09:02:52,854 INFO datanode.DataNode (LogAdapter.java:info(47)) - registered UNIX signal handlers for [TERM, HUP, INT] 2018-04-21 09:02:53,321 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdb/hadoop/hdfs/data/ 2018-04-21 09:02:53,330 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdc/hadoop/hdfs/data/ 2018-04-21 09:02:53,330 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdd/hadoop/hdfs/data/ 2018-04-21 09:02:53,331 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sde/hadoop/hdfs/data/ 2018-04-21 09:02:53,331 INFO checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdf/hadoop/hdfs/data/ 2018-04-21 09:02:53,405 WARN checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/ org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:124) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:99) at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:128) at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:44) at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:127) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-04-21 09:02:53,410 ERROR datanode.DataNode (DataNode.java:secureMain(2691)) - Exception in secureMain org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 4, volumes configured: 5, volumes failed: 1, volume failures tolerated: 0 at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2583) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2492) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2539) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2684) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2708) 2018-04-21 09:02:53,411 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2018-04-21 09:02:53,414 INFO datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at worker02.sys65.com/23.87.23.126 ************************************************************/
<br>
we checked that:
1. all files and folders under - /grid/sdc/hadoop/hdfs/ are with hdfs:hadoop , and that is OK
2. disk - sdc is read and write (rw,noatime,data=ordered) , and that is OK
we suspect that Hard Disk has gone bad , in this case how we check that?
please advice what the other options to resolve this issue ?
Created 04-25-2018 08:25 PM
The disk is already unusable go-ahead run fsck with a -y option to repair it 🙂 see above
Either way you will have to replace that dirty disk anyways!
Created 04-25-2018 07:35 PM
Any updates?
Created 04-25-2018 07:38 PM
Hi, Geoffrey
we just waiting for your approval about the following steps:
1. umount /grid/sdc or umount -l /grid/sdc in case devise is busy
2. fsck -y /dev/sdc
3. mount /grid/sdc
Created 04-25-2018 08:03 PM
Avahi is a system which facilitates service discovery on a local network via the mDNS/DNS-SD protocol suite. This enables you to plug your laptop or computer into a network and instantly be able to view other people who you can chat with, find printers to print to or find files being shared. Compatible technology is found in Apple MacOS X (branded Bonjour and sometimes Zeroconf)
The two big benefits of Avahi are name resolution & finding printers, but on a server, in a managed environment, it's of little value.
unmounting and mount filesystems are a common thing especially in Hadoop clusters, your SysOps team should have validated that, but all looks correct to me.
Do a dry run with the below code to see what will be affected that will give you a better picture.
# e2fsck -n /dev/sdc
The data will be reconstructed as you have default replication factor you can later rebalance the HDFS data
Created 04-25-2018 08:05 PM
yes we already did it on one of the disks , see please - https://community.hortonworks.com/questions/189016/datanode-machine-worker-one-of-the-disks-have-fil...
Created 04-25-2018 08:25 PM
The disk is already unusable go-ahead run fsck with a -y option to repair it 🙂 see above
Either way you will have to replace that dirty disk anyways!