Support Questions

zing · ‎04-04-2017

Hello.

MapReduce job couldn't start because a file cannot be readable.

When I tried to access the file, the following error happened.

org.apache.hadoop.ipc.RemoteException(java.lang.ArrayIndexOutOfBoundsException): java.lang.ArrayIndexOutOfBoundsException

    at org.apache.hadoop.ipc.Client.call(Client.java:1466)
    at org.apache.hadoop.ipc.Client.call(Client.java:1403)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy11.getListing(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:559)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy12.getListing(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2080)

My action.

(1) sudo -u hdfs hdfs fsck /

　fsck is stopped just in front of the error file. the result is "Failed"

/services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd01.sv.infra.log 244412 bytes, 1 block(s):  OK
/services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd02.sv.infra.log 282901 bytes, 1 block(s):  OK
/services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd03.sv.infra.log 280334 bytes, 1 block(s):  OK
/services/chikayo-dsp-bidder/click/hive/day=20170403/14.fluentd01.sv.infra.log 258240 bytes, 1 block(s):  OK
FSCK ended at Mon Apr 03 18:16:08 JST 2017 in 3074 milliseconds
null


Fsck on path '/services/chikayo-dsp-bidder' FAILED

(2) sudo -u hdfs hdfs dfsadmin -report

Configured Capacity: 92383798755328 (84.02 TB)
Present Capacity: 89209585066072 (81.14 TB)
DFS Remaining: 19736633480052 (17.95 TB)
DFS Used: 69472951586020 (63.19 TB)
DFS Used%: 77.88%
Under replicated blocks: 0
Blocks with corrupt replicas: 2
Missing blocks: 0
Missing blocks (with replication factor 1): 0

Now, the error file is restored automatically. "Blocks with corrupt repilicas" is 0.

Question.

(1)Can I restore same error file manually ?

(2)What is the trigger by which restore is started ?

Thank you.

ernieg3 · ‎08-24-2017

I ran into this issue myself. I was able to resolve it like this:

hadoop fs -setrep 2 /hdfs/path/to/file
hadoop fs -setrep 3 /hdfs/path/to/file

After changing the replication factor, I was able to access the file again.

View solution in original post

weichiu · ‎04-04-2017

Hi, It appears to be a bug and I am interested to understand this bug further. I did a quick search and it doesn't seem to be reported previous on Apache Hadoop Jira.

Would you be able to look at the Active NameNode log and search for

ArrayIndexOutOfBoundsException

The client side of log doesn't print its stack trace so it's impossible to know where this exception was thrown. NameNode log should likely contain the entire stacktrace, which will help finding where it originated.

zing · ‎04-04-2017

Thanks to your reply.

Logs have already rotated, so I cannot find the exception message.

As the error often happen, I will put the exception message later.

Thank you.

zing · ‎04-05-2017

Hello.

The error has happened. But no trace messages is in active namenode logs.

2017-04-06 08:08:01,571 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number of corrupt replicas for blk_1124785595_195687655 blockMap has 0 but corrupt replicas map has 1
2017-04-06 08:08:01,571 WARN org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR
java.lang.ArrayIndexOutOfBoundsException
2017-04-06 08:08:01,716 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(block=BP-396578656-10.1.24.1-1398308945648:blk_1124823444_195767215, newGenerationStamp=195767253, newLength=12040, newNodes=[10.1.24.24:50010, 10.1.24.55:50010, 10.1.24.25:50010], clientName=DFSClient_NONMAPREDUCE_1100616919_56)
2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-bc2b3178-d3e5-49a4-9bc6-189804bf833e:NORMAL:10.1.24.24:50010
2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-0fdfc364-08c8-4f90-b20e-151c332060b6:NORMAL:10.1.24.55:50010
2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-17ad7233-40c6-4a68-a4a6-449c975c27ef:NORMAL:10.1.24.25:50010

The file is written through webhdfs by fluentd.

Version.

CDH 5.10.0-1.cdh5.10.0.p0.41
fluentd 0.10.50
fluent-plugin-webhdfs 0.2.1

Thank you.

weichiu · ‎04-06-2017

Got it.
The warning message "Inconsistent number of corrupt replicas" suggests you may have encountered the bug described in HDFS-9958 (BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.)

HDFS-9958 is fixed in a number of CDH versions:
CDH5.5.6
CDH5.7.4 CDH5.7.5 CDH5.7.6
CDH5.8.2 CDH5.8.3 CDH5.8.4
CDH5.9.0 CDH5.9.1
CDH5.10.0 CDH5.10.1

Unfortunately, given that you're already on CDH5.10.0, it appears to be a new bug that gives this symptom.

I can file an Apache Hadoop jira on your behalf for this bug report. The Cloudera Community forum is supposed to be a troubleshooting site, and bug reports should be sent to Apache Hadoop so that more people can look into it.

zing · ‎04-11-2017

Hi.

Thanks to reply.

I have 3 hadoop clusters that is same version. But this error happens to only one cluster.

The version of flunetd which send data to hadoop is different from other clusters.

First, I will upgrade flunetd.

After uprade, I will tell you the result.

Thank you.

weichiu · ‎04-11-2017

Can you try to restart NameNode and see if it helps?

The symptom matches HDFS-10788: https://issues.apache.org/jira/browse/HDFS-10788 and I initially thought HDFS-10788 is resolved via HDFS-9958, but apparently that's not the case.

weichiu · ‎04-11-2017

If restarting NameNode doesn't help, see if you can bump NameNode log level to DEBUG and post the NameNode log (or you can send that to me privately weichiu at cloudera dot com)

ernieg3 · ‎08-24-2017

I ran into this issue myself. I was able to resolve it like this:

hadoop fs -setrep 2 /hdfs/path/to/file
hadoop fs -setrep 3 /hdfs/path/to/file

After changing the replication factor, I was able to access the file again.

zing · ‎08-27-2017

Thank you.

Errors have disappered after executing commands.

That's very useful.

Support Questions

Cannot get a file on HDFS becouse of "java.lang.ArrayIndexOutOfBoundsException"