Member since
10-16-2013
307
Posts
77
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11284 | 04-17-2018 04:59 PM | |
6232 | 04-11-2018 10:07 PM | |
3578 | 03-02-2018 09:13 AM | |
22367 | 03-01-2018 09:22 AM | |
2672 | 02-27-2018 08:06 AM |
06-29-2015
11:26 AM
There is no enforcement of key constraints or auto increment in Impala. Please understand that Impala and SQL Server are quite different technologies each with their own unique set of advantades/disadvantages. What kind of operations are you doing based on auto increment?
... View more
06-29-2015
10:49 AM
This feature is currrently not available. May I ask what your use case is?
... View more
06-29-2015
10:48 AM
1 Kudo
Can you try enabling "use native query"? The driver will send the query to Impala verbatim (sometimes the driver may make some changes to the SQL). http://www.cloudera.com/content/cloudera/en/documentation/connectors/latest/PDF/Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf
... View more
06-25-2015
03:05 PM
Makes sense. I appreciate your thorough question, and I completely agree that we should point out this expression-substitution behavior in the performance guide. It's not the first time it has come, and I'd imagine it will not be the last 🙂 Btw, if you really really want to get the materialization behavior with an inline view without an ORDER BY, then you can apply the following terrible hack. Original query: select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable) v Modified query to force materialization of inline view: select a, b, c from (select f(x) as a, f(y) as b, f(z) as c from mytable union all select NULL, NULL, NULL from mytable where false) v The "union all" will force materialization, but the second union operand will be dropped due to the "false" predicate. Obviously, that behavior is implementation defined and subject to change any time, so it would be wise not to rely on it.
... View more
06-24-2015
02:24 PM
1 Kudo
Sorry for the wait. Here's what I think is happening. Impala deals with inline views by substituting the select-list expressions from the inline view in the parent query block. What that means in your case, is that many of the expensive expressions inside your inline view are executed multiple times in the slow non-ORDER-BY version of your query. For example, every reference to "setup_time" in the outer select list is replaced by the corresponding expression from the inline view, i.e. setup_time --> case when regexp_extract(...) then hours_add(...) else setup_time_ts end setup_time As a result, not only are those expensive expressions only executed at the coordinator, but the expensive expressions are evaluated multiple times because they are referenced multiple times in the outer select list. In the ORDER BY version of the query, this redundant expression evaluation is avoided because the ORDER BY materialized its input, so while the same inline view expression substitutions takes place, the outer references are substituted with materialized column references (i.e., the expensive expression is only evaluated once), i.e. setup_time --> materialized column produced by the ORDER BY Hope this makes sense! Alex
... View more
06-23-2015
09:29 AM
Actually, since there's buffering on both sender/receiver sides I don't see how there could be a 10x difference between the queries. I believe ther is a much simpler explanation. Stay tuned for another response.
... View more
06-23-2015
09:17 AM
1 Kudo
Thank you for posting such a detalied description. Your observation regarding expression evaluation is correct: Impala evaluates the expressions lazily. To summarize: - In the slow version without ORDER BY, the SCAN sends the raw data to the coordinator which then evaluates all expressions including those from your inline view. - In the fast version with ORDER BY, the expressions from the inline view are evaulated and materialized at the SORT NODE, i.e., in paralell on all nodes. Now, you had already observed this and you asked how this can explain the 10x difference whereas you'd only expect a 3x difference based on the 3x increased paralellizm. The answer is that in the slow version the entire query execution is CPU bound by the single coordinator node. Impala's execution engine is streaming, so the coordinator will apply backpressure to the stream sender and in turn the SCANs, if it cannot process the rows quickly enough (which in this case it obviously cannot). So it means while the coordinator is still processing a batch rows, the SCANs will not be able to make progress (it's not quite as simple as this, but it explains the mechanichs).
... View more
05-22-2015
02:09 PM
That change only affects values being parsed inside a scan node. In your example you are casting a literal - the query option will have no effect on that.
... View more
05-22-2015
11:14 AM
In that case, there's still a chance you can get the your desired behavior. When scanning text data, Impala does have an option to abort on any parsing error encountered, e.g., if you declared a column as INT, but a particular text value could not be parsed as an INT in the scan. You can enable this behavior with a query option, see: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_abort_on_error.html
... View more
05-20-2015
10:05 PM
Thanks for the explanation, that makes sense. I'm afraid that Impala currently doesn't behave like that, although I can certainly see how it would make sense in some scenarios. One possible way to workaround this limitation is to filter rows with IS NOT NULL for those interesting casts.
... View more