Support Questions
Find answers, ask questions, and share your expertise

hive s3a performance issue

Explorer

Hello,

I have created an external table pointing to s3 location.

when I run a query using hive or beeline it is taking a lot of time to retrieve the result. The file format that I use is orc.

just to give you some perspective: external table created using HDFS ORC object gives results in less than 30 sec while the same takes more than 30 min when the object in S3.

Also I see lot of below error in task logs (and I think that's taking a lot of time to retrieve the result as I see in the log for more than 30 mins the same error constantly.

2017-03-16 02:41:28,153 [INFO] [TezChild] |http.AmazonHttpClient|: Unable to execute HTTP request: Read timed out
java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
	at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
	at org.apache.http.impl.client.DefaultRequestDirector.createTunnelToTarget(DefaultRequestDirector.java:902)
	at org.apache.http.impl.client.DefaultRequestDirector.establishRoute(DefaultRequestDirector.java:821)
	at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:647)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
	at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:91)
	at org.apache.hadoop.fs.s3a.S3AInputStream.seek(S3AInputStream.java:115)
	at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
	at org.apache.hadoop.hive.ql.io.orc.MetadataReader.readStripeFooter(MetadataReader.java:111)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:245)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:831)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:802)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1013)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1046)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1101)
	at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:120)
	at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:54)
	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:328)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

Accepted Solutions

Re: hive s3a performance issue

Explorer

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

View solution in original post

5 REPLIES 5

Re: hive s3a performance issue

Re: hive s3a performance issue

Contributor

Can you share the details of the HDP/HDC version used and the s3a connector version being used?. If it is recent versions you can enable "fs.s3a.experimental.input.fadvise=random" for ORC dataset, which reduces the number of connection establishments and breakages to S3.

Not sure if you are using EC2 instances or on-prem for accessing S3. But in case the machine itself has network connectivity issues to S3, easiest option could be to eliminate that node or fix the n/w inconsistency.

Re: hive s3a performance issue

Explorer

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

Re: hive s3a performance issue

Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.

Re: hive s3a performance issue

Explorer

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

View solution in original post