Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to discp a partitioned multi-level directory while preserving structure

avatar
Rising Star

I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?

9 REPLIES 9

avatar
Master Mentor

@Peter Coates

Look into Apache falcon to automate this

Also ...Falcon mirroring is fantastic http://falcon.apache.org/Recipes.html

avatar
Rising Star

Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.

avatar
Master Mentor

@Peter Coates Falcon calls distcp.

You can leverage that. I know you mentioned that in your comment that you are looking something other than distcp

You can install falcon fairly easy

avatar
Rising Star

I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.

avatar
Master Mentor

avatar
Master Mentor

avatar

Hello @Peter Coates. I expect DistCp to copy recursively and fully preserve the nested directory structure from the source, not stop one level deep. Do you have a specific example where you are not seeing recursive behavior?

avatar
Rising Star

Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.

avatar

Hi,

@Peter Coates

Assuming you have moderate number of files did you tried the below option:

bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo

Where srclist contains (you can populate this file by recursive listing)

hdfs://nn1:8020/foo/dir1/a

hdfs://nn1:8020/foo/dir2/b

More info here:

https://hadoop.apache.org/docs/r1.2.1/distcp2.html

Please let me know if this works.

Thanks