Recently, we faced a scenario where we were getting incorrect records from a hive query.
Basically, the join condition would give the correct count of records however a certain column would always gave the values from first column in the select clause.
select a.a1, a.a2,
case when b.b1 is 'Hello' then'H' else 'Z' as b2,
from alpha a LEFT OUTER JOIN beta b on b.id=a.id;
This was supposed to give results like :
1 apple H fruit
2 banana Z fruit
3 cat H animal
Instead it gave results like:
1 apple Z fruit
2 banana Z fruit
3 cat Z animal
This is because b.b1 was somehow fetching the value for a.a1 and there by based on the CASE statement it set the value as 'Z'. Infact, I later removed the a.a1 from select clause to debug, it now picked up a.a2 values as this now became the first element in the select clause.
The same query worked when we were not using ORC format.
After deep investigation, HWX Support recommended we set the following property to false:
Lo and behlod, this solved this issue.
Thus, I want to understand how enabling or disabling a property related to vectorization could alter the result.
I understand enabling & disabling vectorization can have significant impact on performance but cannot understand why the result set would be different
The reasons are many. To narrow it down check the datatype of the field. There are cases when timestamp will be stored as long value in intermediate layer while processing the hive query. in your case i don't think the datatype is long but still give a try. And check the explain plan of the query before and after enabling the vectorization. If it is enabled then you should be able to see the notation vectorized in the explain plan. I faced the same issue with timestamp but when I disabled vectorization then it worked fine for me.