Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Vectorization error/doubt

Hive Vectorization error/doubt

Rising Star

Hello experts,

One of our client was facing below error while doing a left outer join (scope was to flatten the data)

error "Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:193, Vertex vertex_1461710542055_2118_9_00 [Map 3] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 2,"

We suggested a work around to set below to false since in certain cases where a partitioned table has a DDL change. Vectorization has a problem.

"set hive.vectorized.execution.enabled=false"

And the above solved the issue.

My Question : What is the benefit of having the parameter? My understanding is vectorization minimizes high CPU use, so a 'false' value mean we lose performance? How can we add a new attributes on a Large Partition table without performance degradation or getting queries to fail as it did earlier? Suggestions for best practices.

Thanks Mayank

4 REPLIES 4
Highlighted

Re: Hive Vectorization error/doubt

Super Guru

@mkataria from the apache site here.:

By default it is set to false.

Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory. A detailed design document is attached to the vectorized query execution JIRA, at https://issues.apache.org/jira/browse/HIVE-4160.

Using Vectorized Query Execution

Enabling vectorized execution

To use vectorized query execution, you must store your data in ORC format, and set the following variable as shown in Hive SQL (see Configuring Hive):

set hive.vectorized.execution.enabled = true;

Vectorized execution is off by default, so your queries only utilize it if this variable is turned on. To disable vectorized execution and go back to standard execution, do the following:

set hive.vectorized.execution.enabled = false;

Additional configuration variables for vectorized execution are documented in Configuration Properties – Vectorization.

Highlighted

Re: Hive Vectorization error/doubt

Super Guru

@mkataria Could you also post the log file. there may be additional errors which may help identify issue.

Highlighted

Re: Hive Vectorization error/doubt

Rising Star

@Sunile Manjee I will reproduce the error and will revert with logs.

Highlighted

Re: Hive Vectorization error/doubt

New Contributor

@mkataria

Hi,

How vectorization helps in terms of group by, order by and join?

I have some basic idea about the same that it is used to reduce CPU usage, to process 1024 rows at a time and each column will be stored in vector.

1) Is there anything else that vectorisation can do in performance tuning and query optimiztion?

2) How it is faster then? in term of distribute by? in terms group by, order by and join?

3) How does vectorise collects the data from Mappers and Reduces?

4) How does it works internally and why ORC file format only to process?

Thanks in Advance!

Regards,

Dada Karade

Don't have an account?
Coming from Hortonworks? Activate your account here