Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

S3 to S3 movement/deletion of files is slow

S3 to S3 movement/deletion of files is slow

Explorer

Hello,

Recently I faced performance issues while moving files from one directory to other. Checked multiple posts online, that confirmed the same. Compared to HDFS (where movement is just like changing the pointer on name node), S3 actually the copies the files from one partition to other.

I checked with AWS team as well. They said if we distribute put files into different directories/buckets, the S3 will internally distribute data into different partitions(as it is based on hashing), It could improve performance. Examples are given below.

Example 1:

s3://bucket1/data/file1

s3://bucket1/data/file2

s3://bucket1/data/file3

s3://bucket1/data/file2

Example 2:

s3://bucket1/0a/file1

s3://bucket1/1a/file2

s3://bucket1/2a/file3

s3://bucket1/3a/file4

According to them, if move command is executed for moving files, second should improve better. For few files, I have not seen much difference. For bulk data, I will try.

Any other suggestions?

Thanks

Shubham

2 REPLIES 2
Highlighted

Re: S3 to S3 movement/deletion of files is slow

Mentor

have you tried using s3a? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html Qubole has an article they just published to improve performance of listing directories also, https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/ maybe that can help?

Highlighted

Re: S3 to S3 movement/deletion of files is slow

Explorer

@Artem Ervits I tried s3a as well, it took same time. I am in touch with qubole for this. Will post if I get any information from them.Thanks for info!

Don't have an account?
Coming from Hortonworks? Activate your account here