Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

DistCP is failing when when creating file listing and encountering a file name with a space

avatar
New Contributor

Hello,

I'm running HDP 2.6, and attempting to use distcp to copy from a much older Hadoop cluster into HDP, so I'm running the distcp utility on the target cluster and accessing the source cluster via hftp:<host>:<port>/<path>. For example:

hadoop distcp -i -log /distcp/logpath hftp://oldhadoop.hostname:50070/path/ /newpath

In the source path there is a file with a space in its name 'Email Address.json', and while distcp is building the copy listing, it appears to fail to decode the name properly (stack trace is below).

17/11/28 16:17:45 INFO tools.DistCp: DistCp job log path: /distcp/logpath

Exception in thread "pool-5-thread-1" java.lang.AssertionError: Failed to decode URI: /path/Email Address.json

	at org.apache.hadoop.util.ServletUtil.decodePath(ServletUtil.java:128)
	at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.startElement(HftpFileSystem.java:446)
	at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
	at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
	at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:465)
	at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.listStatus(HftpFileSystem.java:484)
	at org.apache.hadoop.hdfs.web.HftpFileSystem$LsParser.listStatus(HftpFileSystem.java:492)
	at org.apache.hadoop.hdfs.web.HftpFileSystem.listStatus(HftpFileSystem.java:499)
	at org.apache.hadoop.tools.SimpleCopyListing$FileStatusProcessor.getFileStatus(SimpleCopyListing.java:535)
	at org.apache.hadoop.tools.SimpleCopyListing$FileStatusProcessor.processItem(SimpleCopyListing.java:576)
	at org.apache.hadoop.tools.util.ProducerConsumer$Worker.run(ProducerConsumer.java:190)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

I was hoping that the -i (ignore) option would have ignored errors in creating the file listing as well as the copy phase, but that doesn't appear to be the case. Is there any way to exclude certain file names from the file listing, and/or other ways to possibly work around this issue?

2 REPLIES 2

avatar
Expert Contributor

@Joe Karau What is the exact HDP version you are using?

In 2.6 the -filters option should be available to exclude certain files. It is documented as "The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. Supports regular expressions specified by java.util.regex.Pattern." However, it's questionable whether the filtering happens before the exception. Can you give it a try?

If it doesn't work, unfortunately I think the easiest way to fix this is to specify only the "correct" files to be copied.

avatar
New Contributor

Thank you, the -filters option worked like a charm.