Created 06-09-2020 08:42 AM
I have a Hive Table in ORC format, which I tried to query via Beeline:
SELECT `timestamp`, url FROM events WHERE id='0ef3c9ba6cb5' ORDER BY `timestamp` DESC;
However, this simple query failed with:
INFO : Compiling command(queryId=hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac): select `timestamp`, url from events where partyid='0:3pu60uagp0:db698229-272e-4a1c-a18a-0ef3c9ba6cb5' order by `timestamp` desc
INFO : Warning: Map Join MAPJOIN[16][bigTable=?] in task 'Map 1' is a cross product
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:timestamp, type:bigint, comment:null), FieldSchema(name:url, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac); Time taken: 0.724 seconds
INFO : Executing command(queryId=hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac): select `timestamp`, url from events where partyid='0:3pu60uagp0:db698229-272e-4a1c-a18a-0ef3c9ba6cb5' order by `timestamp` desc
INFO : Query ID = hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId: hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: select `timestamp`, url fr...desc (Stage-1)
INFO : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO : Status: Running (Executing on YARN cluster with App id application_1586459578755_0105)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
VERTICES: 00/00 [>>--------------------------] 0% ELAPSED TIME: 9428.98 s
----------------------------------------------------------------------------------------------
ERROR : Status: Failed------------------------] 0% ELAPSED TIME: 9426.66 s
ERROR : Application application_1586459578755_0105 failed 2 times due to ApplicationMaster for attempt appattempt_1586459578755_0105_000002 timed out. Failing the application.
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Application application_1586459578755_0105 failed 2 times due to ApplicationMaster for attempt appattempt_1586459578755_0105_000002 timed out. Failing the application.
INFO : Completed executing command(queryId=hive_20200605073915_22eb45aa-25f6-419a-9b55-57a0d98e3dac); Time taken: 9433.73 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Application application_1586459578755_0105 failed 2 times due to ApplicationMaster for attempt appattempt_1586459578755_0105_000002 timed out. Failing the application. (state=08S01,code=2)
Any clue what is this error about? Other queries take long too if they ever succeed. Any resource configuration I should look into?
Created on 06-10-2020 03:30 AM - edited 06-10-2020 06:31 AM
FYI my table is partition by Year + Month + Day.
Total file size in HDFS is 10TB.
Total records is 21 Billion records.
We have 8 data nodes in HDP.
I'm using HDP-3.0.1.0 with Ambari. Here's my current Hive config:
Tez Container Size: 3072 MB
HiveServer2 Heap Size: 4096 MB
Memory: 819.2 MB
Data per Reducer: 2042.9 MB
They are mostly the default values.
Do they make sense?
Any suggestion on which to increase / decrease for optimum performance?