I'm using hive's fs.s3a.proxy.host andfs.s3a.proxy.port to send data through http proxy. Outbound data goes through proxy but reads do not--i.e. they are read directly from S3. Is this a known bug? Is there some other setting I'm missing? Using Hue as the client. Is it possible there is some kind of cache in the path?
Can anyone shed any light on the logic surrounding Hive's decision about where to send a data operation? Our proxy apparently never sees S3 read requests while S3 clearly does, and responds. As this all takes place within AWS, is it possible that some hidden network chicanery somehow redirects the requests?
, once the fs.s3a.proxy.host and fs.s3a.proxy.port settings are configured, S3A will configure the S3 client in the AWS SDK to route all HTTP requests through the proxy server. This is not set up in a way that differentiates write traffic vs. read traffic, so I can't think of any explanation for seeing different behavior for different kinds of operations. Is it possible that the proxy configuration hasn't been propagated to all relevant processes (Hive CLI, metastore, YARN containers, etc.), and it's just a coincidence that some of these processes do the writing and some of these processes do the reading?
Chris--thanks. I didn't see your reply. I had only been trying simple selects, for which it reliably fails, but upon complicating the query so that it results in a Tez job, it behaves correctly every time. Of course, there is no reason to do a distributed operation when all of the results go to one client receiving the data, so it seems that Hive disregards the HDFS proxy settings when you do this. I think this would be the only time that the proxy is any of Hive's concern, because normally, all those interactions are delegated to the mappers and reducers, are they not? BTW, it seems that anything that will force a Tez job causes the proxy to be used, including an insert into HDFS, which is what is important in this case. But there should be a Hive setting for the proxies too, shouldn't there?
To clarify, it makes no sense for HiveServer2 to do a MapReduce for a select(*) because even if it did, it would still have to fetch the results into one process to merge them. So to save time in this one case, HS2 does the API calls itself instead of handing the work off to one or more worker processes. There seems to be a bug in Hive, however, in that it does not check for the HDFS fs.s3a.proxy.host and fs.s3a.proxy.port settings, but instead deals directly with S3. I am told that the hive setting hive.fetch.task.conversion=none should force a map reduce even if HS2 would have otherwise shortcut the process and read directly.
Hive's code uses that same s3a connector as Tez...it should be reading those same proxy settings, even from inside its main process. That s3a code always reads them...which makes me think maybe it's not getting those settings passed in