Created 11-29-2017 01:17 AM
Hello,
I'm running HDP 2.6, and attempting to use distcp to copy from a much older Hadoop cluster into HDP, so I'm running the distcp utility on the target cluster and accessing the source cluster via hftp:<host>:<port>/<path>. For example:
hadoop distcp -i -log /distcp/logpath hftp://oldhadoop.hostname:50070/path/ /newpath
In the source path there is a file with a space in its name 'Email Address.json', and while distcp is building the copy listing, it appears to fail to decode the name properly (stack trace is below).
17/11/28 16:17:45 INFO tools.DistCp: DistCp job log path: /distcp/logpath Exception in thread "pool-5-thread-1" java.lang.AssertionError: Failed to decode URI: /path/Email Address.json at org.apache.hadoop.util.ServletUtil.decodePath(ServletUtil.java:128) at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.startElement(HftpFileSystem.java:446) at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source) at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:465) at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.listStatus(HftpFileSystem.java:484) at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.listStatus(HftpFileSystem.java:492) at org.apache.hadoop.hdfs.web.HftpFileSystem.listStatus(HftpFileSystem.java:499) at org.apache.hadoop.tools.SimpleCopyListing$FileStatusProcessor.getFileStatus(SimpleCopyListing.java:535) at org.apache.hadoop.tools.SimpleCopyListing$FileStatusProcessor.processItem(SimpleCopyListing.java:576) at org.apache.hadoop.tools.util.ProducerConsumer$Worker.run(ProducerConsumer.java:190) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I was hoping that the -i (ignore) option would have ignored errors in creating the file listing as well as the copy phase, but that doesn't appear to be the case. Is there any way to exclude certain file names from the file listing, and/or other ways to possibly work around this issue?
Created 11-29-2017 02:45 PM
@Joe Karau What is the exact HDP version you are using?
In 2.6 the -filters option should be available to exclude certain files. It is documented as "The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Supports regular expressions specified by java.util.regex.Pattern." However, it's questionable whether the filtering happens before the exception. Can you give it a try?
If it doesn't work, unfortunately I think the easiest way to fix this is to specify only the "correct" files to be copied.
Created 12-07-2017 07:13 PM
Thank you, the -filters option worked like a charm.