Created 09-26-2016 10:49 PM
Can anyone explain exactly what's going on here? When running "set hive.tez.exec.print.summary=true;" with large hive queries over S3, the job is only about half over when Hive/Tez prints all the job stats as if the job is complete. But the following is the final line (slightly obfuscated) and the copy takes as long as the query itself.
INFO : Moving data to: s3a://xxxxxxxxxxx/incoming/mha/poc/.hive-staging_hive_2016-09-26_17-49-00_060_4187715327928xxxxxx-3/-ext-10000 from s3a://xxxxxxxxxxxx/incoming/mha/poc/.hive-staging_hive_2016-09-26_17-49-00_060_4187715327928xxxxxx-3/-ext-10002
What is the reason for the data being moved? If the same thing happens with HDFS it's not noticeably, probably because it's just moving pointers around, but on S3 it seems to be actually moving the data. (a) is this true and (b) why the movement?
Created 09-27-2016 01:04 AM
This is not specific to "hive.tez.exec.print.summary=true", which prints the summary details of the DAG. In this case, DAG ran a lot faster and the delay you are observing is due to the file movement from S3 to S3 as a part of final cleanup activity of the job.
Hive moves the job output to final location and this activity is carried out in the hive-client. In S3, rename is a "copy + delete" operation. So even though this rename is done in the AWS side, it takes time depending on the amount of data that is churned out by the job. In HDFS, rename is a lot cheaper operation and hence you do not observe this delay in HDFS. Alternate option is to write the data to local HDFS and move the data to S3 via distcp.
Created 09-27-2016 01:04 AM
This is not specific to "hive.tez.exec.print.summary=true", which prints the summary details of the DAG. In this case, DAG ran a lot faster and the delay you are observing is due to the file movement from S3 to S3 as a part of final cleanup activity of the job.
Hive moves the job output to final location and this activity is carried out in the hive-client. In S3, rename is a "copy + delete" operation. So even though this rename is done in the AWS side, it takes time depending on the amount of data that is churned out by the job. In HDFS, rename is a lot cheaper operation and hence you do not observe this delay in HDFS. Alternate option is to write the data to local HDFS and move the data to S3 via distcp.
Created 09-27-2016 02:15 PM
Thanks. That's what I thought---it's negligible in HDFS but not always trivial in S3 because it's a copy+delete. Interesting idea about using distcp to transfer the data. Not sure if that would actually help with EBS backing HDFS but it's worth a try.
Created 09-27-2016 03:14 PM
This brings up an issue. When the S3->S3 moves occur, does the data move across the local LAN link or does this occur entirely within the S3 infrastructure. I.e., if you copy, a NAS-backed file on a server it is read in across the LAN and then written out again. S3 isn't a NAS in that sense--but is this what it does, or does S3 move the data around on its own networks when the move is S3-S3? This matters because network is probably our limiting resource with our query types.
Created 09-28-2016 08:22 PM
@Peter Coates: There is no local download and upload (distcp does that, which is bad). This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change, In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.