Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Datanode Randomly Failing Volumes

Datanode Randomly Failing Volumes

Explorer

Hi guys, we see this error happening randomly on a specific datanode. We see this in our logs:

2016-03-30 18:27:10,760 WARN  impl.FsDatasetImpl (FsVolumeList.java:checkDirs(244)) - Removing failed volume /disk1/hdfs/data/current:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not readable: /disk1/hdfs/data/current/BP-211597358-192.168.101.5-1431451310172/current/finalized/subdir35/subdi
r216
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:188)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:88)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:309)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:792)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2030)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3153)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:242)
        at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3186)
        at java.lang.Thread.run(Thread.java:745)
  • OS-level: the directories and mount points are fine and they can be read (cd or ls)
  • ACL's are set right
  • Hardware-level: the volumes are healthy
  • `hdfs fsck /` reports no corrupt blocks
  • Ambari doesn’t give out alerts, only the Namenode UI is reporting the failure
  • It happens on a random volume each time
  • When restarted the datanode can now read the volumes

What could cause this?

3 REPLIES 3
Highlighted

Re: Datanode Randomly Failing Volumes

Super Guru

@Ace - Can you please check output of dmesg or sudo /usr/sbin/smartctl -A /dev/[disk] command on datanode(in given example) where we are getting exceptions related to /disk1/ mount point. It could be possible that disk is intermittently going to unresponsive/readonly mode.

Highlighted

Re: Datanode Randomly Failing Volumes

Explorer

Hi Kuldeep thanks for the swift reply, it doesn't happen just on the /disk1/ but it happens on the other disks. Here's the smartctl results:

===/dev/sda===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       816         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdb===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       815         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdc===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       815         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdd===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       816         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sde===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       816         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdf===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       815         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdg===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       815         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdh===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       816         -
# 2  Short offline       Completed without error       00%       812         -


===/dev/sdi===
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.13.0-48-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       815         -
# 2  Short offline       Completed without error       00%       812         -
Highlighted

Re: Datanode Randomly Failing Volumes

Rising Star

@Ace does this happen on one node only? You may have a bad controller if it is happening on random disks and it resolves with a restart of datanode. It definitely sounds like a hardware issue

Don't have an account?
Coming from Hortonworks? Activate your account here