About coatespt

stevel · ‎02-13-2017

Buckets are somehow sharded: the more load you put on the same bucket, the less bandwidth you apparently get from each. This sharding is based on the filename: the more diverse your filenames, the better the bandwidth is likely to be. This is something listed in the AWS docs, but they are deliberately vague as to what happens, so that they have the freedom to change sharding policy without complaints about backwards compatibility. You are also going to get different bandwidth numbers depending on the network capacity of your VM, and faster read rate than write rate. When netflix talk about their performance, assume multiple buckets and many, many, readers

njayakumar · ‎11-02-2016

@Peter Coates - look for the parameters fs.s3a.multipart.threshold and fs.s3a.multipart.size

mqureshi · ‎10-05-2016

@Peter Coates Views in Hive today are purely logical. That means there is no physical data laying around (at least not today) for a view. So when you create a view, all you are really doing is making it easier to write future queries on top of that view or in your case creating views to help with compliance and policies. Once a view is created, you can create access policies for that view on who should have access to the view. This is in addition to the policies you may have at table level. Of course if someone have access to T1 and T2 then restricting View permissions is quite meaning less. In short, no data is laying around for a view after a query completes (almost). I have seen a scenario where temp files created by hive during a query were not being deleted due to query failure. Check this link. Following link should answer your question in more details. https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization

gopalv · ‎09-28-2016

@Peter Coates: There is no local download and upload (distcp does that, which is bad). This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change, In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.

coatespt · ‎09-25-2016

I feared as much. Thank you for your suggestion--I think it work for us, as this is a cloud cluster, and we can archive to S3, obviating the need to use heterogeneous storage for its intended purpose. However, I would like to suggest a Jira ticket to add a storage class for this purpose. There are significant use-cases where it would be useful to know that a subset of your data is confined to specific drives (a) without the restrictions of the existing policies (b) without abusing a storage class for this purpose.

coatespt · ‎09-20-2016

Do you mean 50Mbps per mapper or for the cluster as a whole? (I assume you mean the former, as the latter would imply almost two days to read a TB of S3 data.) Assuming you do mean 50Mbps per mapper, what is the limit on S3 throughput to the whole cluster—that’s the key information. Do you have a ballpark number for this?

Online	Offline
Last Visited	‎02-12-2019 02:08 PM

Member Since	‎08-29-2016 06:20 PM
Last Visited	‎02-12-2019 02:08 PM
Posts	24
Kudos received	8

Cloudera Community

Re: What are the cluster-wide bandwidth limitation...

Re: How can I control the size of the blocks that ...

Re: What's really happening in a Hive view that jo...

Re: set hive.tez.exec.print.summary=true causes od...

Re: How to write HDFS data to a specific device

Re: Are there any special considerations or optimi...