Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How would you download (copy) a directory with WebHDFS API?

avatar

I'm looking at https://hadoop.apache.org/docs/current/hadoop-proj... and I don't find an easy way to copy one folder.

Do I have to get the list of the content of the folder and download one by one?

1 ACCEPTED SOLUTION

avatar

Downloading an entire directory would be a recursive operation that walks the entire sub-tree, downloading each file it encounters in that sub-tree. The WebHDFS REST API alone doesn't implement any such recursive operations. (The recursive=true option for DELETE is a different case, because it's just telling the NameNode to prune the whole sub-tree. There isn't any need to traverse the sub-tree and return results to the caller along the way.) Recursion is something that would have to be implemented on the client side by listing the contents of a directory, and then handling the children returned for that directory.

Depending on what you need to do, it might be sufficient to use the "hdfs dfs -copyToLocal" CLI command using a path with the "webhdfs" URI scheme and a wildcard. Here is an example:

> hdfs dfs -ls webhdfs://localhost:50070/file*
-rw-r--r--   3 chris supergroup          6 2015-12-15 10:13 webhdfs://localhost:50070/file1
-rw-r--r--   3 chris supergroup          6 2015-12-15 10:13 webhdfs://localhost:50070/file2

> hdfs dfs -copyToLocal webhdfs://localhost:50070/file*

> ls -lrt file*
-rw-r--r--+ 1 chris  staff     6B Dec 16 10:23 file2
-rw-r--r--+ 1 chris  staff     6B Dec 16 10:23 file1

In this example, the "hdfs dfs -copyToLocal" command made a WebHDFS HTTP call to the NameNode to list the contents of "/". It then filtered the returned results by the glob pattern "file*". Based on those filtered results, it then sent a series of additional HTTP calls to the NameNode and DataNodes to get the contents of file1 and file2 and write them locally.

This isn't a recursive solution though. Wildcard glob matching is only sufficient for matching a static pattern and walking to a specific depth in the tree. It can't fully discover and walk the whole sub-tree. That would require custom application code.

View solution in original post

5 REPLIES 5

avatar
Master Mentor

add "recursive" switch

curl -i -X DELETE "http://<host>:<port>/webhdfs/v1/<path>?op=DELETE
                              [&recursive=<true |false>]"

avatar

@Artem Ervits Looks like he is asking for way to copy the contents of whole directory rather than deleting it.

avatar
Master Mentor

I'm aware of that, this was the only available example. Just add a recursive switch=true to the command you want to execute.

avatar

Downloading an entire directory would be a recursive operation that walks the entire sub-tree, downloading each file it encounters in that sub-tree. The WebHDFS REST API alone doesn't implement any such recursive operations. (The recursive=true option for DELETE is a different case, because it's just telling the NameNode to prune the whole sub-tree. There isn't any need to traverse the sub-tree and return results to the caller along the way.) Recursion is something that would have to be implemented on the client side by listing the contents of a directory, and then handling the children returned for that directory.

Depending on what you need to do, it might be sufficient to use the "hdfs dfs -copyToLocal" CLI command using a path with the "webhdfs" URI scheme and a wildcard. Here is an example:

> hdfs dfs -ls webhdfs://localhost:50070/file*
-rw-r--r--   3 chris supergroup          6 2015-12-15 10:13 webhdfs://localhost:50070/file1
-rw-r--r--   3 chris supergroup          6 2015-12-15 10:13 webhdfs://localhost:50070/file2

> hdfs dfs -copyToLocal webhdfs://localhost:50070/file*

> ls -lrt file*
-rw-r--r--+ 1 chris  staff     6B Dec 16 10:23 file2
-rw-r--r--+ 1 chris  staff     6B Dec 16 10:23 file1

In this example, the "hdfs dfs -copyToLocal" command made a WebHDFS HTTP call to the NameNode to list the contents of "/". It then filtered the returned results by the glob pattern "file*". Based on those filtered results, it then sent a series of additional HTTP calls to the NameNode and DataNodes to get the contents of file1 and file2 and write them locally.

This isn't a recursive solution though. Wildcard glob matching is only sufficient for matching a static pattern and walking to a specific depth in the tree. It can't fully discover and walk the whole sub-tree. That would require custom application code.

avatar

Thank you! I will play with it.