Recently I faced performance issues while moving files from one directory to other. Checked multiple posts online, that confirmed the same. Compared to HDFS (where movement is just like changing the pointer on name node), S3 actually the copies the files from one partition to other.
I checked with AWS team as well. They said if we distribute put files into different directories/buckets, the S3 will internally distribute data into different partitions(as it is based on hashing), It could improve performance. Examples are given below.
According to them, if move command is executed for moving files, second should improve better. For few files, I have not seen much difference. For bulk data, I will try.
Any other suggestions?
have you tried using s3a? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html Qubole has an article they just published to improve performance of listing directories also, https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/ maybe that can help?