Support Questions

mike_bronson7 · ‎04-23-2018

we have ambari cluster HDP version 2.6.0.1

we have issues on worker02 according to the log - hadoop-hdfs-datanode-worker02.sys65.com.log,

2018-04-21 09:02:53,405 WARN  checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data

note - from ambari GUI we can see that Data-node on worker02 is down

we can see from the log - Directory is not writable: /grid/sdc/hadoop/hdfs/data the follwing:

STARTUP_MSG: Starting DataNode
STARTUP_MSG:   user = hdfs
STARTUP_MSG:   host = worker02.sys65.com/23.87.23.126
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3.2.6.0.3-8
STARTUP_MSG:   build = git@github.com:hortonworks/hadoop.git -r c6befa0f1e911140cc815e0bab744a6517abddae; compiled by 'jenkins' on 2017-04-01T21:32Z
STARTUP_MSG:   java = 1.8.0_112
************************************************************/
2018-04-21 09:02:52,854 INFO  datanode.DataNode (LogAdapter.java:info(47)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-04-21 09:02:53,321 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdb/hadoop/hdfs/data/
2018-04-21 09:02:53,330 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdc/hadoop/hdfs/data/
2018-04-21 09:02:53,330 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdd/hadoop/hdfs/data/
2018-04-21 09:02:53,331 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sde/hadoop/hdfs/data/
2018-04-21 09:02:53,331 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(107)) - Scheduling a check for [DISK]file:/grid/sdf/hadoop/hdfs/data/
2018-04-21 09:02:53,405 WARN  checker.StorageLocationChecker (StorageLocationChecker.java:check(208)) - Exception checking StorageLocation [DISK]file:/grid/sdc/hadoop/hdfs/data/
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /grid/sdc/hadoop/hdfs/data
	at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:124)
	at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:99)
	at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:128)
	at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:44)
	at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:127)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-04-21 09:02:53,410 ERROR datanode.DataNode (DataNode.java:secureMain(2691)) - Exception in secureMain
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 4, volumes configured: 5, volumes failed: 1, volume failures tolerated: 0
	at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:216)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2583)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2492)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2539)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2684)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2708)
2018-04-21 09:02:53,411 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2018-04-21 09:02:53,414 INFO  datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at worker02.sys65.com/23.87.23.126
************************************************************/

<br>

we checked that:

1. all files and folders under - /grid/sdc/hadoop/hdfs/ are with hdfs:hadoop , and that is OK

2. disk - sdc is read and write (rw,noatime,data=ordered) , and that is OK

we suspect that Hard Disk has gone bad , in this case how we check that?

please advice what the other options to resolve this issue ?

Michael-Bronson

Shelton · ‎04-25-2018

@Michael Bronson

The disk is already unusable go-ahead run fsck with a -y option to repair it 🙂 see above

Either way you will have to replace that dirty disk anyways!

View solution in original post

Shelton · ‎04-23-2018

@Michael Bronson

Could you try umounting and mount that disk? Your disk could have gone bac and the FS is in Read-Only mode Can you also set the failure tolerance to 1

Using Ambari UI--> HDFS-->Configs---Filter in for property "dfs.datanode.failed.volumes.tolerated" set it to 1

Restart stale HDFS services

All should be in order

mike_bronson7 · ‎04-23-2018

Dear Geoffrey , we do resboot twice before weeks , but this not help ( when we reboot we do actually remount , about dfs.datanode.failed.volumes.tolerated" set it to 1 , we want to set it to 0 ( we not want loose one disk )

Michael-Bronson

Shelton · ‎04-23-2018

@Michael Bronson

There could be a couple of reasons, lete check the obvious have you checked SE Linux on this host? if not

$ echo 0 >/selinux/enforce 
$ cat /selinux/enforce # should output "0"

Read-only filesystem" is not a permissions issue. The mount has become read-only, either because of errors in the filesystem or problems in the device itself. If you run "grep sdc /proc/mounts" you should see it as "ro". There may be some clue as to why in the messages in /var/log/syslog.

Run File system check fsck it will repair some of the errors e.g execute the fsck on an unmounted file system to avoid any data corruption issues. e.g

# fsck /dev/sdc

That should repair the damages.

mike_bronson7 · ‎04-24-2018

Dear Geoffrey , /grid/sdc hold HDFS filesystem , dose fsck on that disk not risky? , see also - http://fibrevillage.com/storage/658-how-to-use-hdfs-fsck-command-to-identify-corrupted-files

Michael-Bronson

mike_bronson7 · ‎04-24-2018

what you think about the following steps to fix corrupted files ( I take it from - https://stackoverflow.com/questions/19205057/how-to-fix-corrupt-hdfs-files )

to determine which files are having problems , this ignores lines with nothing but dots and lines talking about replication.

hdfs fsck / | egrep -v '^\.+$' | grep -v eplica

Michael-Bronson

mike_bronson7 · ‎04-24-2018

Once you find a file that is corrupt
hdfs fsck /path/to/corrupt/file -locations -blocks -files

Michael-Bronson

Shelton · ‎04-24-2018

@Michael Bronson

Above you are trying to fix corrupt HDFS blocks !! With the default replication factor of 3, you should be okay and below is fixing the filesystem

What is your filesystem type ext4 or? You can run

# e2fsck -y /dev/sdc

You will not have an opportunity to validate the corrections being applied. On the other hand if you run

# e2fsck -n /dev/sdc

You can see what would happen without it actually being applied and if you run you'll be asked each

# e2fsck /dev/sdc

time a significant correction needs to be applied.

mike_bronson7 · ‎04-24-2018

Dear Geoffrey , the filesystem is ext4

Michael-Bronson

mike_bronson7 · ‎04-24-2018

Dear Geoffrey , as you know before performing fsck /dev/sdc , we must umount /grid/sdc , or umount -l /grid/sdc , only then we can run fsck /dev/sdc , so can you approve finally the following steps:

1. umount /grid/sdc or umount -l /grid/sdc in case devise is busy

2. fsck /dev/sdc

?

Michael-Bronson

Cloudera Community

Support Questions

datanode + Directory is not writable

How to Move or Change HDFS DataNode Directories

Datanode Service Error Related to NFS Mount Issue

Garbage Collection Pauses in Namenode and Datanode

Datanode data directory permissions

Datanode Balancer bandwidth configuration

"kinit: Preauthentication failed while getting ini...

Can't use datanodes if data directory is on separa...

Restore data from datanode after doing hdfs nameno...

DATANODE high HEAP SIZE alert

Delete old BP-* DataNode directories by hand?