Created 06-09-2020 08:53 AM
I have a Hive Table in ORC format. I used Zeppelin to query the table with jdbc(hive)
SELECT `timestamp`, url FROM events where id='f9e43fc7b' ORDER BY `timestamp` DESC
However, the query runs into below error:
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1591402457216_0009_2_01, diagnostics=[Vertex vertex_1591402457216_0009_2_01 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: evt initializer failed, vertex=vertex_1591402457216_0009_2_01 [Map 1], java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Matcher.<init>(Matcher.java:225)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at org.apache.hadoop.hive.ql.io.AcidUtils$BucketMetaData.parse(AcidUtils.java:318)
at org.apache.hadoop.hive.ql.io.AcidUtils$BucketMetaData.parse(AcidUtils.java:332)
at org.apache.hadoop.hive.ql.io.AcidUtils.parseBucketId(AcidUtils.java:367)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.determineSplitStrategy(OrcInputFormat.java:2331)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.determineSplitStrategies(OrcInputFormat.java:2306)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1811)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:522)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:777)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1591402457216_0009_2_02, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1591402457216_0009_2_02 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:401)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:266)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:718)
at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:801)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:103)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:633)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any idea regarding this error? Any resource configuration I should look into?
Created on 06-10-2020 03:30 AM - edited 06-10-2020 06:09 AM
FYI my table is partition by Year + Month + Day.
Total file size in HDFS is 10TB.
Total records is 21 Billion records.
We have 8 data nodes in HDP.
Created 06-10-2020 03:42 AM
@ShuwnYuan Thats a pretty big query. If you do not have enough memory in yarn available to the containers building the query it will fail with this error. You are going to need to increase the tez container size in Hive Configuration.
Created 06-10-2020 06:29 AM
@stevenmatison Thanks for prompt reply.
I'm using HDP-3.0.1.0 with Ambari. Here's my current Hive config:
Tez Container Size: 3072 MB
HiveServer2 Heap Size: 4096 MB
Memory: 819.2 MB
Data per Reducer: 2042.9 MB
They are mostly the default values.
Do they make sense?
Any suggestion on which to increase / decrease for optimum performance?