Member since
07-29-2015
463
Posts
121
Kudos Received
87
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
52 | 11-27-2019 11:18 AM | |
133 | 11-06-2019 10:08 AM | |
98 | 10-23-2019 02:10 PM | |
282 | 07-24-2019 04:28 PM | |
322 | 06-18-2019 02:38 PM |
04-05-2019
08:53 AM
1 Kudo
You can apply memory limits at two levels - at the Impala daemon level, which limits the total memory consumption of the process (in part so that it doesn't exceed the physical memory available, but also so that it leaves memory available for other services running on the host). You can (and should) also apply memory limits at the query level via the MEM_LIMIT query option (the one we were talking about). That controls how much of the process memory limit that a single query can get. E.g. if you're using admission control you can configure query memory limits that get applied to all queries in a resource pool. It would be weird if running a query resulted in the impala daemon memory limit to change and I'm not sure what you would even expect to happen if you ran two queries at the same time. I don't know if this helps, but I gave a talk recently that summarised some of the concepts here. There are slides linked from here - https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/73000 By the way, only allocating 1GB to each impala daemon is a bad idea for a production deployment - that's simply not enough to run a lot of more complex queries on larger data sets, particularly if you are running multiple concurrent queries. We have some sizing guidelines - https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_usf_qln_3bb
... View more
04-04-2019
01:49 PM
I jsut tested with ClouderaImpalaJDBC-2.6.4.1005 and it works for me with the following JDBC url. I can see in the query profile that it takes effect. static final String DB_URL = "jdbc:impala://localhost:21050/functional_parquet;mem_limit=3gb"; From the profile: Query Options (set by configuration): MEM_LIMIT=3221225472
... View more
03-29-2019
10:59 AM
1 Kudo
Hi @ChineduLB , There is no real difference between Impala and Hive tables - Impala and Hive should be able to read and write the same tables, including partitioned tables, etc.
... View more
03-29-2019
10:58 AM
I filed a JIRA with the Apache project so that there's more visibility into this issue:https://issues.apache.org/jira/browse/IMPALA-8373
... View more
03-26-2019
10:19 AM
Impala expect your UDF code and dependencies to be in a single .so, so you'd have to statically link any libraries you depend on.
... View more
03-25-2019
03:40 PM
1 Kudo
This isn't possible unless you include a timestamp or sequence number in every record. There's no concept of an order of rows built into Hive or Impala.
... View more
03-25-2019
12:36 AM
void FunnelInit(FunctionContext* context, StringVal* val) {
EventLogs* eventLogs = new EventLogs();
val->ptr = (uint8_t*) eventLogs;
// Exit on failed allocation. Impala will fail the query after some time.
if (val->ptr == NULL) {
*val = StringVal::null();
return;
}
val->is_null = false;
val->len = sizeof(EventLogs);
} I did another scan and the memory management in the above function is also slightly problematic - the memory attached to the intermediate StringVal would be better allocated from the Impala UDF interface so that Impala can track the memory consumption. E.g. see https://github.com/cloudera/impala-udf-samples/blob/bc70833/uda-sample.cc#L76 . I think the real issue though is the EventLogs data structure and lack of a Serialize() function. It's a somewhat complex nested structure with the string and vector. In order for the UDA to work, you need to have a Serialize() function that flattens out the intermediate result into a single StringVal. This is pretty unavoidable since Impala needs to be able to send the intermediate values over the network and/or write it to disk, and Impala doesn't know enough about your data structure to do it automatically. Our docs do mention this here https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html#udafs. Putting it into practice is a bit tricky. One working example is the implementation of reservoir sampling in Impala itself. Unfortunately I think it's a little over-complicated:https://github.com/apache/impala/blob/df53ec/be/src/exprs/aggregate-functions-ir.cc#L1067 The general pattern for complex intermediate values is to have a "header" that lets your determine whether the intermediate value is currently serialized, then either the deserialized representation, or the serialized representation after the "header" using a flexible array member or similar - https://en.wikipedia.org/wiki/Flexible_array_member. The Serialize() function will convert the representation by packing any nested structures into a single StringVal() with the header in front. Then other functions can switch back to the deserialized representation. Or in some cases, you can be clever and avoid the conversion in some case (that's what the reservoir sample function above is doing, and part of why it's overly complex). Anyway, a really rough illustration of the idea is as follows: struct DeserializedValue {
...
}
struct IntermediateValue {
bool serialized;
union {
DeserializedValue val;
char buf[0];
};
StringVal Serialize() {
if (serialized) {
// Just copy serialized representation to output StringVal
} else {
// Flatten val into an output StringVal
}
}
void DeserializeIfNeeded() {
if (serialized) {
// Unpack buf into val.
}
}
}; Just as a side note, the use of C++ builtin vector and string in the intermediate value can be problematic if they're large, since Impala doesn't account for the memory involved. But that's very much a second-order problem compared to it not working at all.
... View more
03-22-2019
08:38 PM
delete src.ptr; <-- that is a bug that will definitely cause Impala to crash if you run the UDA enough times. Impala manages that memory and it's definitely not valid to free it yourself! The Impala runtime will automatically manage memory for StringVal inputs.
... View more
03-14-2019
10:31 AM
I think you're probably running into this issue: https://issues.apache.org/jira/browse/IMPALA-8109 It would help to provide "SHOW FILES" output for the table and to provide the Impala version that you're running (i.e. output of "select version()")
... View more
03-14-2019
10:29 AM
What file format are you using? Can you attach an Impala query profile from the query?
... View more
03-07-2019
09:14 AM
Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala. If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.
... View more
03-07-2019
09:12 AM
1 Kudo
The query profile and/or execution summary is the best reference for this. Parallelism for Parquet files depends on the number of HDFS blocks (which is usually the same as the number of Parquet files), so if your tables only have one HDFS block each you may not get parallelism.
... View more
03-07-2019
09:02 AM
Oh, the best reference for building Impala is the apache wiki. https://cwiki.apache.org/confluence/display/IMPALA/Building+native-toolchain+from+scratch+and+using+with+Impala is a bit more hidden and talks about how to build the third party dependencies
... View more
03-07-2019
09:01 AM
You'd probably do better having a conversation about this on dev@impala.apache.org, that's where a lot of this kind of discussion happens. I can give a quick answer. No you can't build Impala without modifications on aarch64, it's x86-64 only at the moment. I imagine most of the third party code works with aarch64, but haven't tried it. It would require a bit of legwork to track down all the places that assume x86-64 (intrinsics like you mentioned, but also some places in the query compilation where we assume x86-64 calling convention). The good news is that aarch64 is little-endian and has good LLVM support, which removes two major obstacles.
... View more
03-05-2019
09:54 AM
Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table. The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618.
... View more
03-05-2019
09:50 AM
Impala is not designed for traditional OLTP and doesn't have transaction support that would line up with what TPC-C expects.
... View more
02-28-2019
05:13 PM
I believe it's a limitation of the LEAD/LAG implementation that the second argument has to be a constant.
... View more
02-27-2019
06:03 PM
@amirohOthers might find it easier to help if you included the SQL you are running and the error message you encountered. If you can't share the exact SQL because of sensitive column or table names, simplifying the query and renaming columns would be ideal.
... View more
02-13-2019
11:47 AM
I don't think dictionary encoding makes a different to the effectivess of min-max stats, because the data is still going to be in the file in the same order regardless.
... View more
02-12-2019
10:36 AM
I'm not sure that parquet-cpp has any builtin way to sort data - your client code might have to do the sorting before feeding it to parquet-cpp
... View more
02-12-2019
08:21 AM
1 Kudo
The external tool that you are using would have to support ordering the data by those columns. E.g. if you're using hive, it supports SORT BY. If you're writing it from some custom code, that code would need to sort the data before writing it to parquet.
... View more
02-08-2019
03:59 PM
I unfortunately don't know too many of the details of LDAP. Impala doesn't do anything sophisticated to create the directories - it just mkdir() with S_IRWXU|S_IRWXG|S_IRWXO to create the impala-scratch subdirectory and any missing parent directories.
... View more
02-08-2019
08:51 AM
1 Kudo
I'll assume that you have some directories configured and passed in via the --scratch_dirs argument (you can check the debug page on port 25000 or the impalad.INFO log to confirm the flag value). Then what likely happened is that there's some reason the directories weren't usable. Any errors that prevent using the directories are logged at startup.
... View more
02-07-2019
04:02 PM
3 Kudos
The observability should get better in CDH6.1 - queued queries are cancellable and have more information available in the profile about the cause of being queued. There isn't an aggregate limit on concurrency across pools. There is a limit on the number of connections to each impalad - --fe_service_threads. But since the queries were submitted, that proves that the client was at least connected!! One possibility is if you have "Maximum memory" set on your resource pools, memory-based admission control may be limiting admission based on available memory. Another possibility is that the queries in the CREATED state aren't queued in admission control, but are rather in planning, e.g. blocked waiting to load metadata. Do you have any profiles from the queries that were queued for a long time?
... View more
02-07-2019
09:41 AM
1 Kudo
If you want to do the implicit join between the table and the nested collection, you need to reference the nested collection using the alias that you used for the table. Otherwise the top-level table and the nested collection are treated as independent table references and the query means "return the cartesian product of the tables". I.e. you want to rewrite as follows: select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta,
rta.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001; That will solve your issue.
... View more
01-17-2019
05:14 PM
Yeah looks like that is it! If the queue has space the query would just get queued instead.
... View more
01-17-2019
04:51 PM
There's a big "Refresh Dynamic Resource Pools" button in the bottom-left of the "Impala Admission Control" screen when the configs are stale.
... View more
01-17-2019
02:25 PM
@Daggersis it possible that you didn't refresh the admission control pool configurations after changing them?
... View more
01-10-2019
08:56 AM
The Cloudera Manager queries page has the bytes spilled to disk as one of the metrics it tracks per query. Also in CM, there's a "Cluster utilization report" that has some aggregate information about how much data is spilled to disk over longer time windows. Also, if you're looking at the scratch files themselves the query ID is embedded in the file name (although that's an implementation detail and could change in the future).
... View more
01-09-2019
12:57 PM
They're used for spill to disk - see https://www.cloudera.com/documentation/enterprise/latest/topics/impala_scalability.html#spill_to_disk
... View more