Member since
10-02-2018
3
Posts
0
Kudos Received
0
Solutions
12-06-2018
09:47 PM
on Hive (version 3.1.0.3.0.1.0-187) I get an exception when i use a udaf along with a count distinct. I have tried it with multiple udaf so i think it is not limited to my udaf implementation. NOT WORKING SELECT
hll_estimate(c1) c_2,
count(distinct c1) c_3
FROM
(
select 1 as c1
union all
select 1 as c1
union all
select 2 as c1
)t1 EXCEPTION Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"_col0":1},"value":{"_col0":"abc.0.68:s24_AAAYAQEAAAACd9wX"}}
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:372)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public boolean com.zzz.qqq.yyy.xxx$Evaluator.merge(java.lang.String) with arguments {abc.0.68:s24_AAAYAQEAAAACd9wX}:argument type mismatch
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1111)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.merge(GenericUDAFBridge.java:176)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:216)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:641)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:880)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:724)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:790)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363)
... 19 more
Caused by: java.lang.IllegalArgumentException: argument type mismatch
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1091)
... 26 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1543959275889_0020_4_03 [Reducer 3] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, vertexId=vertex_1543959275889_0020_4_04, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1543959275889_0020_4_04 [Reducer 4] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 (state=08S01,code=2)
PLAN INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing command(queryId=hive_20181206174558_f56287a1-5438-4598-9f44-6e3e22a6d176); Time taken: 0.004 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Map 1 <- Union 2 (CONTAINS) |
| Map 5 <- Union 2 (CONTAINS) |
| Map 6 <- Union 2 (CONTAINS) |
| Reducer 3 <- Union 2 (SIMPLE_EDGE) |
| Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 4 |
| File Output Operator [FS_13] |
| Group By Operator [GBY_19] (rows=1 width=16) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(VALUE._col0)","count(VALUE._col1)"] |
| <-Reducer 3 [CUSTOM_SIMPLE_EDGE] |
| PARTITION_ONLY_SHUFFLE [RS_18] |
| Group By Operator [GBY_17] (rows=1 width=192) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(_col1)","count(_col0)"] |
| Group By Operator [GBY_16] (rows=1 width=188) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(VALUE._col0)"],keys:KEY._col0 |
| <-Union 2 [SIMPLE_EDGE] |
| <-Map 1 [CONTAINS] |
| Reduce Output Operator [RS_24] |
| PartitionCols:_col0 |
| Group By Operator [GBY_23] (rows=1 width=188) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(_col0)"],keys:_col0 |
| Select Operator [SEL_21] (rows=1 width=4) |
| Output:["_col0"] |
| TableScan [TS_20] (rows=1 width=10) |
| <-Map 5 [CONTAINS] |
| Reduce Output Operator [RS_29] |
| PartitionCols:_col0 |
| Group By Operator [GBY_28] (rows=1 width=188) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(_col0)"],keys:_col0 |
| Select Operator [SEL_26] (rows=1 width=4) |
| Output:["_col0"] |
| TableScan [TS_25] (rows=1 width=10) |
| <-Map 6 [CONTAINS] |
| Reduce Output Operator [RS_34] |
| PartitionCols:_col0 |
| Group By Operator [GBY_33] (rows=1 width=188) |
| Output:["_col0","_col1"],aggregations:["hll_estimate(_col0)"],keys:_col0 |
| Select Operator [SEL_31] (rows=1 width=4) |
| Output:["_col0"] |
| TableScan [TS_30] (rows=1 width=10) |
| |
+----------------------------------------------------+
A query without distinct works fine 0: jdbc:hive2://localhost:10000> SELECT
. . . . . . . . . . . . . . . .> hll_estimate(c1) c_2,
. . . . . . . . . . . . . . . .> count ( c1) c_3
. . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . .> (
. . . . . . . . . . . . . . . .> select 1 as c1
. . . . . . . . . . . . . . . .> union all
. . . . . . . . . . . . . . . .> select 1 as c1
. . . . . . . . . . . . . . . .> union all
. . . . . . . . . . . . . . . .> select 2 as c1
. . . . . . . . . . . . . . . .> )t1;
INFO : Compiling command(queryId=hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365): SELECT
hll_estimate(c1) c_2,
count ( c1) c_3
FROM
(
select 1 as c1
union all
select 1 as c1
union all
select 2 as c1
)t1
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c_2, type:bigint, comment:null), FieldSchema(name:c_3, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365); Time taken: 0.243 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365): SELECT
hll_estimate(c1) c_2,
count ( c1) c_3
FROM
(
select 1 as c1
union all
select 1 as c1
union all
select 2 as c1
)t1
INFO : Query ID = hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId: hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: SELECT
hll_estimate(c1) c_2,
count ...c1
)t1 (Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1543959275889_0021)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
Map 4 .......... container SUCCEEDED 1 1 0 0 0 0
Map 5 .......... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 8.73 s
----------------------------------------------------------------------------------------------
INFO : Completed executing command(queryId=hive_20181206182613_b986f93e-fc9e-4fdb-a814-dd5e59ce5365); Time taken: 13.706 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+------+------+
| c_2 | c_3 |
+------+------+
| 2 | 3 |
+------+------+
... View more
Labels:
10-04-2018
04:49 PM
Hey @Naresh P R , thanks for the response. HIVE-6348 should solve the issue by removing the order by from the subquery, however the ticket seems to be treating it as an optimization. Would you know why the data corruption is happening?
... View more
10-03-2018
01:13 AM
EDIT : A more minimal failure case at the bottom
I have a table with definition
CREATE TABLE `testtable`(
`color_gbakc2` string,
`noq_empty_gbakc3` bigint,
`color_gbakc1` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://<ip>:8020/apps/hive/warehouse/testtable'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='1',
The data in it looks like
select * from testTable
testtable.color_gbakc2,testtable.noq_empty_gbakc3,testtable.color_gbakc1
"",45456,
Black,15681,Black
Blue,6203,Blue
Multi,6196,Multi
Red,7716,Red
Silver,5408,Silver
White,908,White
Yellow,7599,Yellow
I have a working query
SELECT t_6.noq_empty noq_empty,
t_6.color color
FROM
(SELECT t_5.color_gbakc2 color,
t_5.noq_empty_gbakc3 noq_empty
FROM
(SELECT testtable.color_gbakc2 color_gbakc2,
testtable.noq_empty_gbakc3 noq_empty_gbakc3,
testtable.color_gbakc1 color_gbakc1
FROM testtable testtable
WHERE testtable.color_gbakc2 IN ('Red',
'Blue',
'Green') ) t_5
WHERE t_5.color_gbakc2 IN ('Red',
'Blue')
ORDER BY noq_empty ASC ) t_6
It returns the expected results
noq_empty
color
6203
Blue
7716
Red
But when I add an order by clause I get unexpected result
SELECT t_6.noq_empty noq_empty,
t_6.color color
FROM
(SELECT t_5.color_gbakc2 color,
t_5.noq_empty_gbakc3 noq_empty
FROM
(SELECT testtable.color_gbakc2 color_gbakc2,
testtable.noq_empty_gbakc3 noq_empty_gbakc3,
testtable.color_gbakc1 color_gbakc1
FROM testtable testtable
WHERE testtable.color_gbakc2 IN ('Red',
'Blue',
'Green') ) t_5
WHERE t_5.color_gbakc2 IN ('Red',
'Blue')
ORDER BY noq_empty ASC ) t_6
ORDER BY color
I get un expected results. Please see the attachment. Running HDP 2.6.5. Is this a known issue? ******EDIT 1******** NOT WORKING SELECT
t6.amt amt, t6.color color
FROM
(SELECT t5.color color,
t5.c1 amt
FROM
(SELECT
t1.c1 c1,
t1.c2 AS color from
(SELECT 7716 AS c1, "Red" AS c2
UNION SELECT 6203 AS c1, "Blue" AS c2) t1
WHERE t1.c2 IN ('Red',
'Blue',
'Green')) t5
WHERE t5.color IN ('Red',
'Blue')
ORDER BY amt ASC) t6
ORDER BY color
Changing the order of columns in outer most select WORKS SELECT
t6.color color,t6.amt amt
FROM
(SELECT t5.color color,
t5.c1 amt
FROM
(SELECT
t1.c1 c1,
t1.c2 AS color from
(SELECT 7716 AS c1, "Red" AS c2
UNION SELECT 6203 AS c1, "Blue" AS c2) t1
WHERE t1.c2 IN ('Red',
'Blue',
'Green')) t5
WHERE t5.color IN ('Red',
'Blue')
ORDER BY amt ASC) t6
ORDER BY color
... View more
Labels: