I have a requirement of copying a zipped archive (.tar.gz) from Amazon S3 into HDFS and in the process uncompress it to create files and sub-directories as contained within the compressed archive. I have the option of keeping the compressed files either as .zip or .tar.gz on S3.
What is the best way that this can be achieved considering that I would like to avoid multiple hops? I have seen tools such as s3distcp but none of them seem to handle archives, but only individual compressed files. Any help would be appreciated.