Created 03-16-2017 03:47 AM
Hello,
I have created an external table pointing to s3 location.
when I run a query using hive or beeline it is taking a lot of time to retrieve the result. The file format that I use is orc.
just to give you some perspective: external table created using HDFS ORC object gives results in less than 30 sec while the same takes more than 30 min when the object in S3.
Also I see lot of below error in task logs (and I think that's taking a lot of time to retrieve the result as I see in the log for more than 30 mins the same error constantly.
2017-03-16 02:41:28,153 [INFO] [TezChild] |http.AmazonHttpClient|: Unable to execute HTTP request: Read timed out java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) at org.apache.http.impl.client.DefaultRequestDirector.createTunnelToTarget(DefaultRequestDirector.java:902) at org.apache.http.impl.client.DefaultRequestDirector.establishRoute(DefaultRequestDirector.java:821) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:647) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:91) at org.apache.hadoop.fs.s3a.S3AInputStream.seek(S3AInputStream.java:115) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) at org.apache.hadoop.hive.ql.io.orc.MetadataReader.readStripeFooter(MetadataReader.java:111) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:245) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:831) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:802) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1013) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1046) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1101) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:120) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:54) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:328) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Created 03-20-2017 03:06 PM
Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue
Created 03-16-2017 06:15 PM
@bhavik shah Maybe this will be helpful http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.11.1/bk_hdcloud-aws/content/s3-hive...
@stevel @Rajesh Balamohan may also have suggestions.
Created 03-20-2017 03:57 AM
Can you share the details of the HDP/HDC version used and the s3a connector version being used?. If it is recent versions you can enable "fs.s3a.experimental.input.fadvise=random" for ORC dataset, which reduces the number of connection establishments and breakages to S3.
Not sure if you are using EC2 instances or on-prem for accessing S3. But in case the machine itself has network connectivity issues to S3, easiest option could be to eliminate that node or fix the n/w inconsistency.
Created 03-20-2017 03:06 PM
Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue
Created 03-21-2017 01:10 PM
Good to hear it is fixed. In future, have a look at this list of causes of this exception which commonly surface in Hadoop. in the core Hadoop networking we automatically add a link to that page and more diagnostics (e.g. destination hostname:port) to socket exceptions...maybe I should see if somehow we can wrap the exceptions coming up from the ASF libraries too.
Created 03-20-2017 03:06 PM
Thanks for the reply. I am on HDP 2.4.2 and Hive 1.2.1.2.4 version. The cluster is created using EC2 instances. I found the solution for the proxy issue. When setup initially, it was setup with fs.s3a.proxy.host and fs.s3a.proxy.port variable. Now we setup the S3 endpoint in route table so it does not use the proxy any more to connect to s3. I removed those two variable from hdfs xml file and that has resolved the performance issue