I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?
Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.
I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.
Hello @Peter Coates. I expect DistCp to copy recursively and fully preserve the nested directory structure from the source, not stop one level deep. Do you have a specific example where you are not seeing recursive behavior?
Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.
Assuming you have moderate number of files did you tried the below option:
bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo
Where srclist contains (you can populate this file by recursive listing)
More info here:
Please let me know if this works.