Support Questions

bshah1 · ‎03-16-2017

Hello,

I have created an external table pointing to s3 location.

when I run a query using hive or beeline it is taking a lot of time to retrieve the result. The file format that I use is orc.

just to give you some perspective: external table created using HDFS ORC object gives results in less than 30 sec while the same takes more than 30 min when the object in S3.

Also I see lot of below error in task logs (and I think that's taking a lot of time to retrieve the result as I see in the log for more than 30 mins the same error constantly.

2017-03-16 02:41:28,153 [INFO] [TezChild] |http.AmazonHttpClient|: Unable to execute HTTP request: Read timed out
java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
	at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
	at org.apache.http.impl.client.DefaultRequestDirector.createTunnelToTarget(DefaultRequestDirector.java:902)
	at org.apache.http.impl.client.DefaultRequestDirector.establishRoute(DefaultRequestDirector.java:821)
	at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:647)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
	at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:91)
	at org.apache.hadoop.fs.s3a.S3AInputStream.seek(S3AInputStream.java:115)
	at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
	at org.apache.hadoop.hive.ql.io.orc.MetadataReader.readStripeFooter(MetadataReader.java:111)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:245)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:831)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:802)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1013)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1046)
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1101)
	at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:120)
	at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:54)
	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:328)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

bshah1 · ‎03-20-2017

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

View solution in original post

Dominika · ‎03-16-2017

@bhavik shah Maybe this will be helpful http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.11.1/bk_hdcloud-aws/content/s3-hive...

@stevel @Rajesh Balamohan may also have suggestions.

rbalamohan · ‎03-20-2017

Can you share the details of the HDP/HDC version used and the s3a connector version being used?. If it is recent versions you can enable "fs.s3a.experimental.input.fadvise=random" for ORC dataset, which reduces the number of connection establishments and breakages to S3.

Not sure if you are using EC2 instances or on-prem for accessing S3. But in case the machine itself has network connectivity issues to S3, easiest option could be to eliminate that node or fix the n/w inconsistency.

bshah1 · ‎03-20-2017

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

stevel · ‎03-21-2017

Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.

bshah1 · ‎03-20-2017

Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue

Cloudera Community

Support Questions

hive s3a performance issue