Member since
09-30-2014
31
Posts
13
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3736 | 10-25-2016 07:02 AM | |
1067 | 10-17-2016 11:34 AM | |
2112 | 01-07-2016 12:46 PM |
06-28-2016
08:59 AM
Hello, I am seeing an issue with fsimage files not being cleaned away from one of the "dfs.namenode.name.dir" directories. The setting of "dfs.namenode.name.dir" in our cluster is "/tmp/hadoop/hdfs/namenode,/var/hadoop/hdfs/namenode,/mnt/data/hadoop/hdfs/namenode". This fills up the /tmp partition on the host hosting the namenode. Listing the contents of these folders show that the /tmp folder contains a lot more fsimage files than the other two folders: [me@node ~]$ ls -la /tmp/hadoop/hdfs/namenode/current | grep fsimage | wc -l
94
[me@node ~]$ ls -la /var/hadoop/hdfs/namenode/current | grep fsimage | wc -l
9
[me@node ~]$ ls -la /mnt/data/hadoop/hdfs/namenode/current | grep fsimage | wc -l
9 Looking at the namenode logs confirms that the purging seems to only happen for /var and /mnt: [me@node ~]$ grep NNStorageRetentionManager /var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log* | grep fsimage/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.7:2016-06-27 19:50:25,462 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/var/hadoop/hdfs/namenode/current/fsimage_0000000002281385227, cpktTxId=0000000002281385227)/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.7:2016-06-27 19:50:25,640 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/mnt/data/hadoop/hdfs/namenode/current/fsimage_0000000002281385227, cpktTxId=0000000002281385227)/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.8:2016-06-27 18:38:58,921 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/var/hadoop/hdfs/namenode/current/fsimage_0000000002280372072, cpktTxId=0000000002280372072)/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.8:2016-06-27 18:38:59,102 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/mnt/data/hadoop/hdfs/namenode/current/fsimage_0000000002280372072, cpktTxId=0000000002280372072)/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.9:2016-06-27 17:34:31,800 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/var/hadoop/hdfs/namenode/current/fsimage_0000000002279353884, cpktTxId=0000000002279353884)/var/log/hadoop/hdfs/hadoop-hdfs-namenode-node.log.9:2016-06-27 17:34:31,992 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:purgeImage(225)) - Purging old image FSImageFile(file=/mnt/data/hadoop/hdfs/namenode/current/fsimage_0000000002279353884, cpktTxId=0000000002279353884) Can anyone explain why only two directories are purged? I should mention that we are running namenode HA. Best Regards /Thomas
... View more
Labels:
- Labels:
-
Apache Hadoop
06-16-2016
07:03 AM
Hi Ravi, I'm not sure I understand what you mean. Is there a tool that could detect our type of disk error and automatically remount the drive in read-only mode? Or are you talking about something like the fstab mount options "errors=remount -ro"? The fstab options only means that if errors are encountered when the os tries to mount the drive for read-write mode, it should try to mount it as read-only. But this does not apply to our situation since our machine is not just starting up, its been up and running for a long while and then disk errors start to occur. If you mean some other tool or configuration that can detect and remount while a system is running, please share a link. Best Regards
... View more
05-23-2016
06:51 AM
Hi Predrag, See my comment to Sagar above, our value of that setting is the default, i.e. zero.
... View more
05-23-2016
06:26 AM
Hi Ashnee. See my comment to Sagar above.
... View more
05-19-2016
12:37 PM
Yes, I agree that is exactly how it seems. There is no problem running ls directly on /mnt/data21. [thomas.larsson@datavault-prod-data8 ~]$ ls -la /mnt/data21
total 28
drwxr-xr-x. 4 root root 4096 9 nov 2015 .
drwxr-xr-x. 26 root root 4096 9 nov 2015 ..
drwxr-xr-x. 4 root root 4096 28 jan 12.32 hadoop
drwx------. 2 root root 16384 6 nov 2015 lost+found
... View more
05-16-2016
12:12 PM
Hi Sagar, I think you misunderstand my question. My question was NOT "In what scenarios does a namenode consider a datanode dead?". It's more a question of why our datanode does not shut itself down when one of its disk is failing. I assumed that this what should happen since our setting of dfs.datanode.failed.volumes.tolerated is the default, i.e. zero.
... View more
05-16-2016
12:06 PM
A follow-up. I forgot to mention our hadoop version: HDP 2.2.6.0, i.e. hadoop 2.6. I looked into the hadoop code and found the org.apache.hadoop.util.DiskChecker class which seems to be used by a monitoring thread to monitor the health of a datanodes disks. In order to try to verify that the datanode actually does not detect this error, I created a very simple Main class that just calls the DiskChecker.checkDirs method. Main.java: import java.io.File;
public class Main {
public static void main(String[] args) throws Exception {
org.apache.hadoop.util.DiskChecker.checkDirs(new File(args[0]));
}
} If I run this class on one of our problematic directories, nothing is detected: [thomas.larsson@datavault-prod-data8 ~]$ /usr/jdk64/jdk1.7.0_67/bin/javac Main.java -cp /usr/hdp/2.2.6.0-2800/hadoop/hadoop-common.jar[thomas.larsson@datavault-prod-data8 ~]$ sudo java -cp .:/usr/hdp/2.2.6.0-2800/hadoop/hadoop-common.jar:/usr/hdp/2.2.6.0-2800/hadoop/lib/* Main /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
However, trying to list the files in this subdir looks like this: [thomas.larsson@datavault-prod-data8 ~]$ sudo ls -la /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir162: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir163: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir155: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir165: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir166: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir164: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir159: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir154: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir153: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir167: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir161: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir157: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir152: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir160: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir156: Input/output error
ls: cannot access /mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir158: Input/output error
total 984
drwxr-xr-x. 258 hdfs hadoop 12288 13 dec 12.52 .
drwxr-xr-x. 258 hdfs hadoop 12288 22 nov 14.50 ..
drwxr-xr-x. 2 hdfs hadoop 4096 12 maj 18.12 subdir0
drwxr-xr-x. 2 hdfs hadoop 4096 12 maj 18.02 subdir1
...
drwxr-xr-x. 2 hdfs hadoop 4096 30 apr 19.21 subdir151
d?????????? ? ? ? ? ? subdir152
d?????????? ? ? ? ? ? subdir153
d?????????? ? ? ? ? ? subdir154
d?????????? ? ? ? ? ? subdir155
d?????????? ? ? ? ? ? subdir156
d?????????? ? ? ? ? ? subdir157
d?????????? ? ? ? ? ? subdir158
d?????????? ? ? ? ? ? subdir159
drwxr-xr-x. 2 hdfs hadoop 4096 12 maj 18.12 subdir16
d?????????? ? ? ? ? ? subdir160
d?????????? ? ? ? ? ? subdir161
d?????????? ? ? ? ? ? subdir162
d?????????? ? ? ? ? ? subdir163
d?????????? ? ? ? ? ? subdir164
d?????????? ? ? ? ? ? subdir165
d?????????? ? ? ? ? ? subdir166
d?????????? ? ? ? ? ? subdir167
drwxr-xr-x. 2 hdfs hadoop 4096 12 maj 18.30 subdir168
drwxr-xr-x. 2 hdfs hadoop 4096 12 maj 18.28 subdir169
...
So, it seems like this problem is undetectable by a datanode.
... View more
05-16-2016
09:14 AM
2 Kudos
Hi. We have encountered issues on our cluster that seems to be caused by bad disks. When we run "dmesg" on the datanode host we see warnings such as: This should not happen!! Data will be lost
sd 1:0:20:0: [sdv] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 1:0:20:0: [sdv] Sense Key : Medium Error [current]
Info fld=0x2f800808
sd 1:0:20:0: [sdv] Add. Sense: Unrecovered read error
sd 1:0:20:0: [sdv] CDB: Read(10): 28 00 2f 80 08 08 00 00 08 00
end_request: critical medium error, dev sdv, sector 796919816
EXT4-fs (sdv1): delayed block allocation failed for inode 70660422 at logical offset 2049 with max blocks 2048 with error -5
In the datanode logs we see warnings such as: 2016-05-16 09:41:42,694 WARN util.Shell (DU.java:run(126)) - Could not get disk usage information
ExitCodeException exitCode=1: du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir162': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir163': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir155': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir165': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir166': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir164': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir159': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir154': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir153': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir167': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir161': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir157': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir152': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir160': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir156': Input/output error
du: cannot access `/mnt/data21/hadoop/hdfs/data/current/BP-1356445971-x.x.x.x-1430142563027/current/finalized/subdir58/subdir158': Input/output error
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.fs.DU.run(DU.java:190)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:119)
at java.lang.Thread.run(Thread.java:745) and : 2016-05-16 09:31:14,494 ERROR datanode.DataNode (DataXceiver.java:run(253)) - datavault-prod-data8.internal.machines:1019:DataXceiver error processing READ_BLOCK operation src: /x.x.x.x:55220 dst: /x.x.x7.x:1019
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1356445971-x.x.x.x-1430142563027:blk_1367398616_293808003
at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:431)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:229)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:493)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
at java.lang.Thread.run(Thread.java:745)
These errors/warnings do not however, seem to be enough for the datanode to consider a volume as "failed" and shut itself down. Some consequences that we have seen when this happens is that it's impossible to scan a hbase region that is served by a regionserver on the same host as the datanode, and also that mapreduce jobs get stuck accessing the host. This brings me to my question: What is the requirement for a datanode to consider a volume as failed? Best Regards /Thomas
... View more
Labels:
- Labels:
-
Apache Hadoop
02-22-2016
07:37 AM
@Wendy Foslien Perhaps you are having the same problem I had, see here: How to connect Kerberized Hive via ODBC and avoid the “No credentials cache found” error
... View more
01-07-2016
12:46 PM
1 Kudo
I found the source code, here: https://github.com/hortonworks/hive-release/releases
... View more
- « Previous
-
- 1
- 2
- Next »