04-04-2017 01:38 AM
MapReduce job couldn't start because a file cannot be readable.
When I tried to access the file, the following error happened.
org.apache.hadoop.ipc.RemoteException(java.lang.ArrayIndexOutOfBoundsException): java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.ipc.Client.call(Client.java:1466) at org.apache.hadoop.ipc.Client.call(Client.java:1403) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy11.getListing(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:559) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy12.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2080)
(1) sudo -u hdfs hdfs fsck /
fsck is stopped just in front of the error file. the result is "Failed"
/services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd01.sv.infra.log 244412 bytes, 1 block(s): OK /services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd02.sv.infra.log 282901 bytes, 1 block(s): OK /services/chikayo-dsp-bidder/click/hive/day=20170403/13.fluentd03.sv.infra.log 280334 bytes, 1 block(s): OK /services/chikayo-dsp-bidder/click/hive/day=20170403/14.fluentd01.sv.infra.log 258240 bytes, 1 block(s): OK FSCK ended at Mon Apr 03 18:16:08 JST 2017 in 3074 milliseconds null Fsck on path '/services/chikayo-dsp-bidder' FAILED
(2) sudo -u hdfs hdfs dfsadmin -report
Configured Capacity: 92383798755328 (84.02 TB) Present Capacity: 89209585066072 (81.14 TB) DFS Remaining: 19736633480052 (17.95 TB) DFS Used: 69472951586020 (63.19 TB) DFS Used%: 77.88% Under replicated blocks: 0 Blocks with corrupt replicas: 2 Missing blocks: 0 Missing blocks (with replication factor 1): 0
Now, the error file is restored automatically. "Blocks with corrupt repilicas" is 0.
(1)Can I restore same error file manually ?
(2)What is the trigger by which restore is started ?
04-04-2017 06:21 AM
Hi, It appears to be a bug and I am interested to understand this bug further. I did a quick search and it doesn't seem to be reported previous on Apache Hadoop Jira.
Would you be able to look at the Active NameNode log and search for
The client side of log doesn't print its stack trace so it's impossible to know where this exception was thrown. NameNode log should likely contain the entire stacktrace, which will help finding where it originated.
04-04-2017 06:14 PM
Thanks to your reply.
Logs have already rotated, so I cannot find the exception message.
As the error often happen, I will put the exception message later.
04-05-2017 05:45 PM
The error has happened. But no trace messages is in active namenode logs.
2017-04-06 08:08:01,571 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number of corrupt replicas for blk_1124785595_195687655 blockMap has 0 but corrupt replicas map has 1 2017-04-06 08:08:01,571 WARN org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR java.lang.ArrayIndexOutOfBoundsException 2017-04-06 08:08:01,716 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(block=BP-396578656-10.1.24.1-1398308945648:blk_1124823444_195767215, newGenerationStamp=195767253, newLength=12040, newNodes=[10.1.24.24:50010, 10.1.24.55:50010, 10.1.24.25:50010], clientName=DFSClient_NONMAPREDUCE_1100616919_56) 2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-bc2b3178-d3e5-49a4-9bc6-189804bf833e:NORMAL:10.1.24.24:50010 2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-0fdfc364-08c8-4f90-b20e-151c332060b6:NORMAL:10.1.24.55:50010 2017-04-06 08:08:01,717 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-17ad7233-40c6-4a68-a4a6-449c975c27ef:NORMAL:10.1.24.25:50010
The file is written through webhdfs by fluentd.
04-06-2017 07:31 AM
04-11-2017 01:12 AM
Thanks to reply.
I have 3 hadoop clusters that is same version. But this error happens to only one cluster.
The version of flunetd which send data to hadoop is different from other clusters.
First, I will upgrade flunetd.
After uprade, I will tell you the result.
04-11-2017 02:33 AM
04-11-2017 02:34 AM
08-24-2017 11:32 AM
I ran into this issue myself. I was able to resolve it like this:
hadoop fs -setrep 2 /hdfs/path/to/file
hadoop fs -setrep 3 /hdfs/path/to/file
After changing the replication factor, I was able to access the file again.