Created 02-02-2016 08:47 PM
I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?
Created 02-02-2016 08:48 PM
Look into Apache falcon to automate this
Also ...Falcon mirroring is fantastic http://falcon.apache.org/Recipes.html
Created 02-03-2016 04:50 PM
Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.
Created 02-03-2016 04:52 PM
@Peter Coates Falcon calls distcp.
You can leverage that. I know you mentioned that in your comment that you are looking something other than distcp
You can install falcon fairly easy
Created 02-03-2016 07:41 PM
I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.
Created 02-06-2016 03:12 PM
@Peter Coates I think you missed my point 😛
Please look into HDFS mirroring in Falcon http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...
Created 02-03-2016 05:48 PM
Created 02-03-2016 06:33 PM
Hello @Peter Coates. I expect DistCp to copy recursively and fully preserve the nested directory structure from the source, not stop one level deep. Do you have a specific example where you are not seeing recursive behavior?
Created 02-05-2016 09:34 PM
Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.
Created 04-20-2016 03:01 AM
Hi,
@Peter Coates
Assuming you have moderate number of files did you tried the below option:
bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo
Where srclist contains (you can populate this file by recursive listing)
hdfs://nn1:8020/foo/dir1/a
hdfs://nn1:8020/foo/dir2/b
More info here:
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
Please let me know if this works.
Thanks