Member since
08-29-2016
24
Posts
8
Kudos Received
0
Solutions
02-13-2017
07:27 PM
1 Kudo
Buckets are somehow sharded: the more load you put on the same bucket, the less bandwidth you apparently get from each. This sharding is based on the filename: the more diverse your filenames, the better the bandwidth is likely to be. This is something listed in the AWS docs, but they are deliberately vague as to what happens, so that they have the freedom to change sharding policy without complaints about backwards compatibility. You are also going to get different bandwidth numbers depending on the network capacity of your VM, and faster read rate than write rate. When netflix talk about their performance, assume multiple buckets and many, many, readers
... View more
11-02-2016
02:00 PM
@Peter Coates - look for the parameters fs.s3a.multipart.threshold and fs.s3a.multipart.size
... View more
10-05-2016
07:04 PM
2 Kudos
@Peter Coates Views in Hive today are purely logical. That means there is no physical data laying around (at least not today) for a view. So when you create a view, all you are really doing is making it easier to write future queries on top of that view or in your case creating views to help with compliance and policies. Once a view is created, you can create access policies for that view on who should have access to the view. This is in addition to the policies you may have at table level. Of course if someone have access to T1 and T2 then restricting View permissions is quite meaning less. In short, no data is laying around for a view after a query completes (almost). I have seen a scenario where temp files created by hive during a query were not being deleted due to query failure. Check this link. Following link should answer your question in more details. https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
... View more
09-28-2016
08:22 PM
@Peter Coates: There is no local download and upload (distcp does that, which is bad).
This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change,
In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.
... View more
09-25-2016
03:21 PM
I feared as much. Thank you for your suggestion--I think it work for us, as this is a cloud cluster, and we can archive to S3, obviating the need to use heterogeneous storage for its intended purpose. However, I would like to suggest a Jira ticket to add a storage class for this purpose. There are significant use-cases where it would be useful to know that a subset of your data is confined to specific drives (a) without the restrictions of the existing policies (b) without abusing a storage class for this purpose.
... View more
09-20-2016
04:14 PM
Do you mean 50Mbps per mapper or for the cluster as a whole? (I assume you mean the former, as the latter would imply almost two days to
read a TB of S3 data.) Assuming you do mean
50Mbps per mapper, what is the limit on S3 throughput to the whole cluster—that’s
the key information. Do you have a ballpark number for this?
... View more