Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive : How does Vectorization work ?

Hive : How does Vectorization work ?

I am using Hive 1.2 on HDP 2.3.

Recently, we faced a scenario where we were getting incorrect records from a hive query.

Basically, the join condition would give the correct count of records however a certain column would always gave the values from first column in the select clause.

Ex:

select a.a1, a.a2, 

case when b.b1 is 'Hello' then'H' else 'Z' as b2, 

b.b3 

from alpha a LEFT OUTER JOIN beta b on b.id=a.id;

This was supposed to give results like :

1   apple     H    fruit

2   banana    Z    fruit

3   cat       H    animal

Instead it gave results like:

1   apple     Z    fruit

2   banana    Z    fruit

3   cat       Z    animal

This is because b.b1 was somehow fetching the value for a.a1 and there by based on the CASE statement it set the value as 'Z'. Infact, I later removed the a.a1 from select clause to debug, it now picked up a.a2 values as this now became the first element in the select clause.

The same query worked when we were not using ORC format.

After deep investigation, HWX Support recommended we set the following property to false:

hive.vectorized.execution.mapjoin.native.enabled

Lo and behlod, this solved this issue.

Thus, I want to understand how enabling or disabling a property related to vectorization could alter the result.

I understand enabling & disabling vectorization can have significant impact on performance but cannot understand why the result set would be different

3 REPLIES 3

Re: Hive : How does Vectorization work ?

@Dinesh Chitlangia

The reasons are many. To narrow it down check the datatype of the field. There are cases when timestamp will be stored as long value in intermediate layer while processing the hive query. in your case i don't think the datatype is long but still give a try. And check the explain plan of the query before and after enabling the vectorization. If it is enabled then you should be able to see the notation vectorized in the explain plan. I faced the same issue with timestamp but when I disabled vectorization then it worked fine for me.

Re: Hive : How does Vectorization work ?

@Bala Vignesh N V - Thank you for your inputs. I am aware of the timestamp issue however in this case the no columns involve timestamp/date

Re: Hive : How does Vectorization work ?

Don't have an account?
Coming from Hortonworks? Activate your account here