- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to discp a partitioned multi-level directory while preserving structure
- Labels:
-
Apache Hadoop
Created ‎02-02-2016 08:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I need to take a list of HDFS directories and copy the contents of those directories to another HDFS using discp. The problem is recursively creating the directories automatically. These are large partitioned files, and the available means seem to preserved structure only one level deep. Can anyone provide an example?
Created ‎02-02-2016 08:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look into Apache falcon to automate this
Also ...Falcon mirroring is fantastic http://falcon.apache.org/Recipes.html
Created ‎02-03-2016 04:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Falcon is not available in my environment, unfortunately. Is there no way to do this without it? This must come up fairly often with partitioned HDFS files and ORC.
Created ‎02-03-2016 04:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter Coates Falcon calls distcp.
You can leverage that. I know you mentioned that in your comment that you are looking something other than distcp
You can install falcon fairly easy
Created ‎02-03-2016 07:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I must have been unclear. We definitely want to use discp and cannot use Falcon for admin reasons. The problem is that I can't get the fully recursive behavior with discp. There's probably a way to do it, but I'm having trouble getting it to build the full depth of the directories on the target if it goes more than one level deep.
Created ‎02-06-2016 03:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Peter Coates I think you missed my point 😛
Please look into HDFS mirroring in Falcon http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...
Created ‎02-03-2016 05:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎02-03-2016 06:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Peter Coates. I expect DistCp to copy recursively and fully preserve the nested directory structure from the source, not stop one level deep. Do you have a specific example where you are not seeing recursive behavior?
Created ‎02-05-2016 09:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aha. The problem turns out to be with the multiple directories named in the file naming the sources. You can have many sources, but only one target. The behavior I was looking for would be for distcp to make a separate tree for each input directory under the target. This seems not to be the way distcp works, but it's easy to script around it.
Created ‎04-20-2016 03:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
@Peter Coates
Assuming you have moderate number of files did you tried the below option:
bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo
Where srclist contains (you can populate this file by recursive listing)
hdfs://nn1:8020/foo/dir1/a
hdfs://nn1:8020/foo/dir2/b
More info here:
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
Please let me know if this works.
Thanks
