Created on 05-06-2018 11:26 AM - edited 09-16-2022 06:11 AM
I have my Hive jobs scheduled through oozie and I randomly get:
ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114) at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:111) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:99) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1977) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1690) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1422) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1206) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1201) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:88) at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:293) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:306) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at com.google.common.base.Throwables.propagate(Throwables.java:156) at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:124) at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:100) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:96) at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:68) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) ... 22 more Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:108) ... 27 more Caused by: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:141) at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more ERROR : Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client. at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:64) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114) at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:111) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:99) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1977) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1690) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1422) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1206) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1201) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:88) at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:293) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:306) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at com.google.common.base.Throwables.propagate(Throwables.java:156) at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:124) at org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:100) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:96) at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:68) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) ... 22 more Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) at org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:108) ... 27 more Caused by: java.util.concurrent.TimeoutException: Timed out waiting for client connection. at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:141) at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask INFO : Completed executing command(queryId=hive_20180506233232_2a9bc92f-d830-435b-91df-eafd2236a9fc); Time taken: 90.051 seconds Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=1) Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default Intercepting System.exit(2) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]
What can I do to fix this? It happens randomly even at times when workload is not much
Cluster Confi:
Total vCores: 60
Total memory: 205 GB
3 worker nodes in my cluster are m4.4x Large ec2 instances, 2 m4.2x large ec2 instance with Yarn (Node manager)
Container memory is set to 8GB in Yarn config
spark.executor.cores=1(since I will have multiple users running queries and also multiple queries running at same time)
spark java heap size = 5GB
Memory overhead = 2GB
Is there something I can do to improve the situation?
Created 05-11-2018 02:50 AM
Hi @sim6
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection.
It looks like the rpc times out waiting for resource getting available on spark side. Given that it is random indicates that this error might be happening when the cluster does not have enough resource and nothing permanently wrong with the cluster as such.
For testing you can explore the following timeout values and see if that helps:
hive.spark.client.connect.timeout=30000ms (default 1000ms)
hive.spark.client.server.connect.timeout=300000ms (default 90000ms)
You'd need to set it up in the Hive Safety Value using the steps below, so that it takes effect for all the spark queries:
Go to Cloudera Manager home page
click through "Hive" service
click "Configuration"
search for "Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml"
enter the following in the XML text field:
<property>
<name>hive.spark.client.connect.timeout</name>
<value>30000ms</value>
</property>
<property>
<name>hive.spark.client.server.connect.timeout</name>
<value>300000ms</value>
</property
Restart Hive services to allow changes to take effect then run the query again to test.
Let us know how it goes.
Created 05-17-2018 11:45 PM
Hello,
Thank you for your response.
I have set the timeout to the recommended value as directed in the response. However, I still get
ERROR : Failed to monitor Job[ 8] with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.util.concurrent.TimeoutException)' org.apache.hadoop.hive.ql.metadata.HiveException: java.util.concurrent.TimeoutException at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:153) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getState(RemoteSparkJobStatus.java:82) at org.apache.hadoop.hive.ql.exec.spark.status.RemoteSparkJobMonitor.startMonitor(RemoteSparkJobMonitor.java:80) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobRef.monitorJob(RemoteSparkJobRef.java:60) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:109) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1977) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1690) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1422) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1206) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1201) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:88) at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:293) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:306) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:49) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:150) ... 23 more ERROR : Failed to monitor Job[ 8] with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(java.util.concurrent.TimeoutException)' org.apache.hadoop.hive.ql.metadata.HiveException: java.util.concurrent.TimeoutException at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:153) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getState(RemoteSparkJobStatus.java:82) at org.apache.hadoop.hive.ql.exec.spark.status.RemoteSparkJobMonitor.startMonitor(RemoteSparkJobMonitor.java:80) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobRef.monitorJob(RemoteSparkJobRef.java:60) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:109) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1977) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1690) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1422) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1206) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1201) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:237) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:88) at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:293) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:306) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:49) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:150) ... 23 more ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask INFO : Completed executing command(queryId=hive_20180518120808_12c3175f-865c-47d4-9748-29853259b4aa); Time taken: 120.892 seconds Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=1) Closing: 0: jdbc:hive2://ip-172-31-4-192.ap-south-1.compute.internal:10000/default Intercepting System.exit(2) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]
Created on 07-20-2018 09:53 AM - edited 07-20-2018 09:58 AM
did you resolve this?
I'm am seeing simliar issues with timeouts as jobs stack on in queue on yarn and dont get a handle/session (unassigned). Clients disconnect and error out. im tickering with timeout settings which seem to address longer processes having to wait.. but if a job is still in queue (cluster is busy) perhaps the client isn't handling correctly?
Traceback (most recent call last):
File "/home/testenv/test/phoenix/cleanse/cleanseDetection.py", line 286, in process_cd_rule
self.execute_hive_query(query, cgid, cd_id)
File "/home/testenv/test/phoenix/cleanse/cleanseDetection.py", line 723, in execute_hive_query
self.hive_session.execute(sql)
File "/home/testenv/test/msense/scripts/hive/hiveClasses/HiveDB.py", line 92, in execute
self.__cursor.execute(statement)
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 304, in execute
self._wait_to_finish() # make execute synchronous
File "/usr/lib/python2.7/site-packages/impala/hiveserver2.py", line 380, in _wait_to_finish
raise OperationalError(resp.errorMessage)
OperationalError: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Created on 08-21-2018 04:46 AM - edited 08-21-2018 04:47 AM
explained well here https://issues.apache.org/jira/browse/HIVE-12650
Try setting this at the code level or in hive-site.xml
hive.spark.client.server.connect.timeout
hive.spark.client.connect.timeout
My busy cluster has jobs queued for upto 10 mins at times, so I'm going to try setting the above timeout values to 900000 and monitor. Will share the results to see if it reduces the failures for my Hive on Spark jobs
Created 04-10-2019 08:31 AM
Hi,
Any update on this? Is this issue resolved? if yes, please let us know the solution, we are also facing the same issue.