Created 10-15-2017 09:25 PM
Hello All,
I have .har file on hdfs for which I am trying to check the list of files that it archived, but getting below error on CDH 5.9.2 cluster.
[user1@usnbka700p ~]$ hdfs dfs -ls har:///user/user1/HDFSArchival/Output1/Archive-13-10-2017-03-10.har
-ls: Fatal internal error
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.fs.HarFileSystem$HarStatus.<init>(HarFileSystem.java:597)
at org.apache.hadoop.fs.HarFileSystem$HarMetaData.parseMetaData(HarFileSystem.java:1201)
at org.apache.hadoop.fs.HarFileSystem$HarMetaData.access$000(HarFileSystem.java:1098)
at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:166)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:102)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
However I can see size of .har file as below.
hdfs dfs -du -s -h /user/user1/HDFSArchival/Output1/Archive-13-10-2017-03-10.har
16.5 G 49.5 G /user/user1/HDFSArchival/Output1/Archive-13-10-2017-03-10.har
Also hdfs command hdfs dfs -ls works for other files. Kindly refer to below logs.
hdfs dfs -ls har:///user/user1/HDFSArchival/Output1/Archive-12-10-2017-07-10.har
Found 1 items
drwxr-xr-x - user1 user1 0 2017-10-12 07:12 har:///user/user1/HDFSArchival/Output1/Archive-12-10-2017-07-10.har/ArchivalTemp
Can you please suggest on this?
Thanks,
Priya
Created 10-16-2017 01:01 AM
It looks like your har file is mal-formed. Inside the har file there is an index file called _index.
The index file is expected to be in the format of <filename> <dir> pair in each line, and the later part of the line seems to get lost.
Created 10-16-2017 01:28 AM
Created 10-16-2017 02:15 AM
Created 10-16-2017 03:55 AM
Created 10-16-2017 10:18 AM
I reproduced the error by intentionally corrupt the _index file.
If you meant "restore" by unarchiving the har file with hdfs dfs -cp command, I find it returns the same AIOOBE, so you won't be able to unarchive it.
Your best bet is to download the _index file, manually repair it, replace the _index file, and see how it goes.
Meanwhile, I filed an Apache jira HADOOP-14950 to handle the AIOOBE better, but it wouldn't help fix your corrupt _index file.
Created 10-16-2017 10:47 PM