About coatespt

coatespt · ‎02-10-2017

Tom--yes, that all sounds right to me. Great answers. It's quite remarkable that they can do this when you consider the implications. It's easy to see how it's done on the network, as they can allocated bandwidth with a fixed usage guarantee and a known capacity on the link. What is mystifying is how it is possible for the physical infrastructure of the volumes to support this, as the the amount of work they do is highly variable depending on specific workload. Whatever is behind the volumes is clearly not your grandpa's NAS, as it can flatten out huge random demand peaks from many users.

coatespt · ‎02-09-2017

Tom--thanks for the reply. With regard to the EBS, yes-all the nodes in question are EBS optimized. The critical question is, do ALL those optimized instances share a single 10Gb network to the EBS hosts? As near as I can tell empirically, as the number of writers or readers increases, the total throughput approaches a limit of about 1/3 of a 10Gb network asymptotically. No matter how many readers or writers, that's the limit. The docs are clear that "optimized" puts a node's EBS on a different LAN but do not seem to indicate whether that alternate network is also 10Gb/sec. Or possibly there is some fancier topology with multiple alternate LANs, etc. The same general behavior happens with S3, but with the limit about 105MB/sec. instead of 350MB/sec. It takes about 8 or 10 threads to hit the limit. Individual readers never go over about 12MB/sec. My numbers are consistent with benchmarks that I've found online, including the asymptotic thing. But I don't think that can really be the limit--Netflix claims they query a petabyte of S3 daily. It would have to be a very long day!

coatespt · ‎02-05-2017

I cannot get any more than about 105 MB/sec into or out of S3 on an AWS cluster with 8 big nodes. (S3 is in the same region.) It takes many parallel requests to get even this much--each individual mapper seems to cap out at barely more than single digit MBPS. It seems to be a limit on the VPC, but there must be a way to increase this, as I read about people processing petabytes of s3 data. Can anyone offer any suggestions on how to configure for reasonable S3 performance? Also, can anyone shed light on how the networking works underneath? Is the S3 traffic coming over the same 10Gb LAN that the instances are using? Does EBS traffice go over the LAN too? Total EBS traffic seems to be in practice limited to about 350MB/sec across the cluster. Does EBS use the same LAN as inter-node traffic? If so, it would seem to be impossible that you could ever exceed a total of 1.25 MB/sec of disk I/O for everything. That can't be right, given the size of clusters I hear about. What's going on?

coatespt · ‎11-02-2016

I'm using insert-into to write data up to S3, but it's writing very large files--0.8GB to 1.8 GB plus one of just a few K. I've tried tez.grouping.max-size and min-size, but neither seems to limit either the min or the max size of the files that are generated. I've also tried controlling the number of mappers and reducers, but to no avail.

coatespt · ‎10-05-2016

Say you have a two tables connected with a meaningless key, i.e., every row from one of the tables corresponds 1:1 to a row in the other table. Maybe the pairs of rows started life in a single table, but it was convenient to store them in two for some reason. To use the data in its complete form you create a view, V, that unites the data, i.e., the join of T1 and T2 where T1.key=T2.key. When you use this view in a query, the result is as if you actually did this join to create a temp table called V, then used V in the query. What you see seems to be different. Hive is free to rearrange the operations more efficiently so long as the result is the same. What does it actually do and how is the strategy determined? This has some implications for us because we'd like to split our data up along lines of requirements for security. What are the implications of this strategy--is the data in it's joined form ever left laying around somewhere exposed?

coatespt · ‎09-27-2016

This brings up an issue. When the S3->S3 moves occur, does the data move across the local LAN link or does this occur entirely within the S3 infrastructure. I.e., if you copy, a NAS-backed file on a server it is read in across the LAN and then written out again. S3 isn't a NAS in that sense--but is this what it does, or does S3 move the data around on its own networks when the move is S3-S3? This matters because network is probably our limiting resource with our query types. @Rajesh Balamohan

coatespt · ‎09-27-2016

Thanks. That's what I thought---it's negligible in HDFS but not always trivial in S3 because it's a copy+delete. Interesting idea about using distcp to transfer the data. Not sure if that would actually help with EBS backing HDFS but it's worth a try.

coatespt · ‎09-26-2016

Can anyone explain exactly what's going on here? When running "set hive.tez.exec.print.summary=true;" with large hive queries over S3, the job is only about half over when Hive/Tez prints all the job stats as if the job is complete. But the following is the final line (slightly obfuscated) and the copy takes as long as the query itself. INFO : Moving data to: s3a://xxxxxxxxxxx/incoming/mha/poc/.hive-staging_hive_2016-09-26_17-49-00_060_4187715327928xxxxxx-3/-ext-10000 from s3a://xxxxxxxxxxxx/incoming/mha/poc/.hive-staging_hive_2016-09-26_17-49-00_060_4187715327928xxxxxx-3/-ext-10002 What is the reason for the data being moved? If the same thing happens with HDFS it's not noticeably, probably because it's just moving pointers around, but on S3 it seems to be actually moving the data. (a) is this true and (b) why the movement?

coatespt · ‎09-25-2016

I feared as much. Thank you for your suggestion--I think it work for us, as this is a cloud cluster, and we can archive to S3, obviating the need to use heterogeneous storage for its intended purpose. However, I would like to suggest a Jira ticket to add a storage class for this purpose. There are significant use-cases where it would be useful to know that a subset of your data is confined to specific drives (a) without the restrictions of the existing policies (b) without abusing a storage class for this purpose.

coatespt · ‎09-23-2016

My existing EBS volumes are transparently encrypted. I added an extra volume that is not encrypted. Now I want to be able to control where HDFS writes a file. I think it must be possible because heterogeneous storage policies tell HDFS where to write. How can I do this?

Online	Offline
Last Visited	‎02-12-2019 02:08 PM

Member Since	‎08-29-2016 06:20 PM
Last Visited	‎02-12-2019 02:08 PM
Posts	24
Kudos received	8

Cloudera Community

Re: What are the cluster-wide bandwidth limitation...

Re: What are the cluster-wide bandwidth limitation...

What are the cluster-wide bandwidth limitations on...

How can I control the size of the blocks that hive...

What's really happening in a Hive view that joins ...

Re: set hive.tez.exec.print.summary=true causes od...

Re: set hive.tez.exec.print.summary=true causes od...

set hive.tez.exec.print.summary=true causes odd be...

Re: How to write HDFS data to a specific device

How to write HDFS data to a specific device