Member since
08-16-2016
48
Posts
9
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4892 | 12-28-2018 10:21 AM | |
5999 | 08-28-2018 10:58 AM | |
3312 | 10-18-2016 11:08 AM | |
3877 | 10-16-2016 10:13 AM |
11-14-2017
02:15 PM
Sorry I am not able to see your uploaded CM figure. From the stacktrace it looks like the NameNode 10.0.0.157 is done. Would you please check?
... View more
11-14-2017
01:52 PM
I am not sure about the CM's warning, but in principal, you should only add an add number of Zookeeper instances, e.g. 3, 5, or even 7. The RetryInvocationHandler warning should be unrelated to the zookeeper issue though. Instead, it's probably that the first namenode is the standby NN. If you manually fail over, I think you wouldn't see the warning again. You might also want to enable command line debug logs with the following command: export HADOOP_ROOT_LOGGER=DEBUG,console
... View more
10-16-2017
10:18 AM
I reproduced the error by intentionally corrupt the _index file. If you meant "restore" by unarchiving the har file with hdfs dfs -cp command, I find it returns the same AIOOBE, so you won't be able to unarchive it. Your best bet is to download the _index file, manually repair it, replace the _index file, and see how it goes. Meanwhile, I filed an Apache jira HADOOP-14950 to handle the AIOOBE better, but it wouldn't help fix your corrupt _index file.
... View more
10-16-2017
02:15 AM
If you still have the source file, try to archive it again and see if it still produces the same error. If so, I'd be interested in knowing what's inside that _index file.
... View more
10-16-2017
01:01 AM
1 Kudo
It looks like your har file is mal-formed. Inside the har file there is an index file called _index. The index file is expected to be in the format of <filename> <dir> pair in each line, and the later part of the line seems to get lost.
... View more
09-19-2017
09:23 AM
Hi @BellRizz thanks offering the workaround. What is the version of CDH that you have? From the description, it sounds like a race condition between reader and writer, and I suspect that is caused by HDFS-11056 (Concurrent append and read operations lead to checksum error) or HDFS-11160 (VolumeScanner reports write-in-progress replicas as corrupt incorrectly). While the summary of HDFS-11160 seems to suggest differently, I have seen customers hitting this issue with concurrent reads and writes.
... View more
06-30-2017
06:34 AM
Interesting story. The decomm process would not complete until all blocks have at least 1 good replica on other DNs. (good replica = replicas that are not stale and on a DataNode that is not being decommissioned or already decommissioned) DirectoryScanner in a DataNode scans the entire directory, reconciling inconsistency between in-memory block map and on-disk replica, so it would eventually pick up the added replica, just a matter of time.
... View more
06-29-2017
02:51 PM
Have you tried restart the DN where you copied the blocks to? Also, try force a full block report: hdfs dfsadmin -triggerBlockReport <datanode_host:ipc_port>
... View more
06-01-2017
09:16 AM
The following is the HDFS NameNode configurations that can be updated to alleviate the issue: <property>
<name>ha.health-monitor.rpc-timeout.ms</name>
<value>45000</value>
<description>
Timeout for the actual monitorHealth() calls.
</description>
</property> I suggest bump ha.health-monitor.rpc-timeout.ms from 45000 (milliseconds) to 90000 and see if it helps.
... View more