- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Order by over subquery return wrong results
- Labels:
-
Apache Hive
-
Apache Tez
Created on ‎10-03-2018 01:13 AM - edited ‎08-17-2019 09:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
EDIT : A more minimal failure case at the bottom
I have a table with definition
CREATE TABLE `testtable`(
`color_gbakc2` string,
`noq_empty_gbakc3` bigint,
`color_gbakc1` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://<ip>:8020/apps/hive/warehouse/testtable' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 'numFiles'='1', |
The data in it looks like
select * from testTable
testtable.color_gbakc2,testtable.noq_empty_gbakc3,testtable.color_gbakc1 "",45456, Black,15681,Black Blue,6203,Blue Multi,6196,Multi Red,7716,Red Silver,5408,Silver White,908,White Yellow,7599,Yellow |
I have a working query
SELECT t_6.noq_empty noq_empty, t_6.color color FROM (SELECT t_5.color_gbakc2 color, t_5.noq_empty_gbakc3 noq_empty FROM (SELECT testtable.color_gbakc2 color_gbakc2, testtable.noq_empty_gbakc3 noq_empty_gbakc3, testtable.color_gbakc1 color_gbakc1 FROM testtable testtable WHERE testtable.color_gbakc2 IN ('Red', 'Blue', 'Green') ) t_5 WHERE t_5.color_gbakc2 IN ('Red', 'Blue') ORDER BY noq_empty ASC ) t_6
It returns the expected results
noq_empty | color |
---|---|
6203 | Blue |
7716 | Red |
But when I add an order by clause I get unexpected result
SELECT t_6.noq_empty noq_empty, t_6.color color FROM (SELECT t_5.color_gbakc2 color, t_5.noq_empty_gbakc3 noq_empty FROM (SELECT testtable.color_gbakc2 color_gbakc2, testtable.noq_empty_gbakc3 noq_empty_gbakc3, testtable.color_gbakc1 color_gbakc1 FROM testtable testtable WHERE testtable.color_gbakc2 IN ('Red', 'Blue', 'Green') ) t_5 WHERE t_5.color_gbakc2 IN ('Red', 'Blue') ORDER BY noq_empty ASC ) t_6 ORDER BY color
I get un expected results. Please see the attachment.
Running HDP 2.6.5.
Is this a known issue?
******EDIT 1********
NOT WORKING
SELECT t6.amt amt, t6.color color FROM (SELECT t5.color color, t5.c1 amt FROM (SELECT t1.c1 c1, t1.c2 AS color from (SELECT 7716 AS c1, "Red" AS c2 UNION SELECT 6203 AS c1, "Blue" AS c2) t1 WHERE t1.c2 IN ('Red', 'Blue', 'Green')) t5 WHERE t5.color IN ('Red', 'Blue') ORDER BY amt ASC) t6 ORDER BY color
Changing the order of columns in outer most select
WORKS
SELECT t6.color color,t6.amt amt FROM (SELECT t5.color color, t5.c1 amt FROM (SELECT t1.c1 c1, t1.c2 AS color from (SELECT 7716 AS c1, "Red" AS c2 UNION SELECT 6203 AS c1, "Blue" AS c2) t1 WHERE t1.c2 IN ('Red', 'Blue', 'Green')) t5 WHERE t5.color IN ('Red', 'Blue') ORDER BY amt ASC) t6 ORDER BY color
Created ‎10-04-2018 09:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HIVE-6348 should resolve this issue. I don't see any workaround for this issue. This fix is available in HDP-3
Created ‎10-04-2018 09:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HIVE-6348 should resolve this issue. I don't see any workaround for this issue. This fix is available in HDP-3
Created ‎10-04-2018 04:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @Naresh P R , thanks for the response.
HIVE-6348 should solve the issue by removing the order by from the subquery, however the ticket seems to be treating it as an optimization.
Would you know why the data corruption is happening?
Created ‎10-04-2018 05:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When applying order by twice, hive assumes column selection order is same on both inner & outer query which does order by on wrong column assuming wrong column datatype. You can rewrite your query having outer & inner query projection as same.
SELECT t_6.color color, t_6.noq_empty noq_empty FROM (SELECT t_5.color_gbakc2 color, t_5.noq_empty_gbakc3 noq_empty FROM (SELECT testtable.color_gbakc2 color_gbakc2, testtable.noq_empty_gbakc3 noq_empty_gbakc3, testtable.color_gbakc1 color_gbakc1 FROM testtable testtable WHERE testtable.color_gbakc2 IN ('Red', 'Blue', 'Green') ) t_5 WHERE t_5.color_gbakc2 IN ('Red', 'Blue') ORDER BY noq_empty ASC ) t_6 ORDER BY color
Fortunately HIVE-6348 is a plan optimizer & its eliminating this issue also.
If my answer helped you, accept the answer. It will help others in the community.
