About alex.behm

alex.behm · ‎06-29-2015

There is no enforcement of key constraints or auto increment in Impala. Please understand that Impala and SQL Server are quite different technologies each with their own unique set of advantades/disadvantages. What kind of operations are you doing based on auto increment?

alex.behm · ‎06-29-2015

This feature is currrently not available. May I ask what your use case is?

alex.behm · ‎06-29-2015

Can you try enabling "use native query"? The driver will send the query to Impala verbatim (sometimes the driver may make some changes to the SQL). http://www.cloudera.com/content/cloudera/en/documentation/connectors/latest/PDF/Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf

alex.behm · ‎06-25-2015

Makes sense. I appreciate your thorough question, and I completely agree that we should point out this expression-substitution behavior in the performance guide. It's not the first time it has come, and I'd imagine it will not be the last 🙂 Btw, if you really really want to get the materialization behavior with an inline view without an ORDER BY, then you can apply the following terrible hack. Original query: select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable) v Modified query to force materialization of inline view: select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable union all select NULL, NULL, NULL from mytable where false) v The "union all" will force materialization, but the second union operand will be dropped due to the "false" predicate. Obviously, that behavior is implementation defined and subject to change any time, so it would be wise not to rely on it.

alex.behm · ‎06-24-2015

Sorry for the wait. Here's what I think is happening. Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query. For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e. setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list. In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e. setup_time --> materialized column produced by the ORDER BY Hope this makes sense! Alex

alex.behm · ‎06-23-2015

Actually, since there's buffering on both sender/receiver sides I don't see how there could be a 10x difference between the queries. I believe ther is a much simpler explanation. Stay tuned for another response.

alex.behm · ‎06-23-2015

Thank you for posting such a detalied description. Your observation regarding expression evaluation is correct: Impala evaluates the expressions lazily. To summarize: - In the slow version without ORDER BY, the SCAN sends the raw data to the coordinator which then evaluates all expressions including those from your inline view. - In the fast version with ORDER BY, the expressions from the inline view are evaulated and materialized at the SORT NODE, i.e., in paralell on all nodes. Now, you had already observed this and you asked how this can explain the 10x difference whereas you'd only expect a 3x difference based on the 3x increased paralellizm. The answer is that in the slow version the entire query execution is CPU bound by the single coordinator node. Impala's execution engine is streaming, so the coordinator will apply backpressure to the stream sender and in turn the SCANs, if it cannot process the rows quickly enough (which in this case it obviously cannot). So it means while the coordinator is still processing a batch rows, the SCANs will not be able to make progress (it's not quite as simple as this, but it explains the mechanichs).

alex.behm · ‎05-22-2015

That change only affects values being parsed inside a scan node. In your example you are casting a literal - the query option will have no effect on that.

alex.behm · ‎05-22-2015

In that case, there's still a chance you can get the your desired behavior. When scanning text data, Impala does have an option to abort on any parsing error encountered, e.g., if you declared a column as INT, but a particular text value could not be parsed as an INT in the scan. You can enable this behavior with a query option, see: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_abort_on_error.html

alex.behm · ‎05-20-2015

Thanks for the explanation, that makes sense. I'm afraid that Impala currently doesn't behave like that, although I can certainly see how it would make sense in some scenarios. One possible way to workaround this limitation is to filter rows with IS NOT NULL for those interesting casts.

Online	Offline
Last Visited	‎05-10-2018 06:52 PM

Member Since	‎10-16-2013 11:04 AM
Last Visited	‎05-10-2018 06:52 PM
Posts	307
Kudos received	77

Cloudera Community

Re: External Table from Parquet folder returns emp...

Re: Impala SQL for KUDU does not work

Re: Impalad logs diskspace full

Re: Impala round function does not return expected...

Re: Is Impala a proces engine when I use kudu?

Re: Is it possiable Auto Increment columns in Impa...

Re: Is it possiable Auto Increment columns in Impa...

Re: ERROR: SELECT query reserved (but escaped) col...

Re: Performance Reduced after Removing ORDER BY cl...

Re: Performance Reduced after Removing ORDER BY cl...

Re: Performance Reduced after Removing ORDER BY cl...

Re: Performance Reduced after Removing ORDER BY cl...

Re: Incorrect CAST as INTEGER

Re: Incorrect CAST as INTEGER

Re: Incorrect CAST as INTEGER