Member since
07-29-2015
535
Posts
140
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4530 | 12-18-2020 01:46 PM | |
2826 | 12-16-2020 12:11 PM | |
1904 | 12-07-2020 01:47 PM | |
1482 | 12-07-2020 09:21 AM | |
972 | 10-14-2020 11:15 AM |
03-07-2019
09:01 AM
You'd probably do better having a conversation about this on dev@impala.apache.org, that's where a lot of this kind of discussion happens. I can give a quick answer. No you can't build Impala without modifications on aarch64, it's x86-64 only at the moment. I imagine most of the third party code works with aarch64, but haven't tried it. It would require a bit of legwork to track down all the places that assume x86-64 (intrinsics like you mentioned, but also some places in the query compilation where we assume x86-64 calling convention). The good news is that aarch64 is little-endian and has good LLVM support, which removes two major obstacles.
... View more
03-05-2019
09:54 AM
Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table. The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618.
... View more
03-05-2019
09:50 AM
Impala is not designed for traditional OLTP and doesn't have transaction support that would line up with what TPC-C expects.
... View more
02-13-2019
11:47 AM
I don't think dictionary encoding makes a different to the effectivess of min-max stats, because the data is still going to be in the file in the same order regardless.
... View more
02-12-2019
10:36 AM
I'm not sure that parquet-cpp has any builtin way to sort data - your client code might have to do the sorting before feeding it to parquet-cpp
... View more
02-12-2019
08:21 AM
1 Kudo
The external tool that you are using would have to support ordering the data by those columns. E.g. if you're using hive, it supports SORT BY. If you're writing it from some custom code, that code would need to sort the data before writing it to parquet.
... View more
02-08-2019
03:59 PM
I unfortunately don't know too many of the details of LDAP. Impala doesn't do anything sophisticated to create the directories - it just mkdir() with S_IRWXU|S_IRWXG|S_IRWXO to create the impala-scratch subdirectory and any missing parent directories.
... View more
02-08-2019
08:51 AM
1 Kudo
I'll assume that you have some directories configured and passed in via the --scratch_dirs argument (you can check the debug page on port 25000 or the impalad.INFO log to confirm the flag value). Then what likely happened is that there's some reason the directories weren't usable. Any errors that prevent using the directories are logged at startup.
... View more
02-07-2019
09:41 AM
1 Kudo
If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables". I.e. you want to rewrite as follows: select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta,
rta.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001; That will solve your issue.
... View more
01-10-2019
08:56 AM
The Cloudera Manager queries page has the bytes spilled to disk as one of the metrics it tracks per query. Also in CM, there's a "Cluster utilization report" that has some aggregate information about how much data is spilled to disk over longer time windows. Also, if you're looking at the scratch files themselves the query ID is embedded in the file name (although that's an implementation detail and could change in the future).
... View more