About Tim Armstrong

Tim Armstrong · ‎04-05-2019

You can apply memory limits at two levels - at the Impala daemon level, which limits the total memory consumption of the process (in part so that it doesn't exceed the physical memory available, but also so that it leaves memory available for other services running on the host). You can (and should) also apply memory limits at the query level via the MEM_LIMIT query option (the one we were talking about). That controls how much of the process memory limit that a single query can get. E.g. if you're using admission control you can configure query memory limits that get applied to all queries in a resource pool. It would be weird if running a query resulted in the impala daemon memory limit to change and I'm not sure what you would even expect to happen if you ran two queries at the same time. I don't know if this helps, but I gave a talk recently that summarised some of the concepts here. There are slides linked from here - https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/73000 By the way, only allocating 1GB to each impala daemon is a bad idea for a production deployment - that's simply not enough to run a lot of more complex queries on larger data sets, particularly if you are running multiple concurrent queries. We have some sizing guidelines - https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_usf_qln_3bb

Tim Armstrong · ‎04-04-2019

I jsut tested with ClouderaImpalaJDBC-2.6.4.1005 and it works for me with the following JDBC url. I can see in the query profile that it takes effect. static final String DB_URL = "jdbc:impala://localhost:21050/functional_parquet;mem_limit=3gb"; From the profile: Query Options (set by configuration): MEM_LIMIT=3221225472

Tim Armstrong · ‎03-29-2019

Hi @ChineduLB , There is no real difference between Impala and Hive tables - Impala and Hive should be able to read and write the same tables, including partitioned tables, etc.

Tim Armstrong · ‎03-26-2019

Impala expect your UDF code and dependencies to be in a single .so, so you'd have to statically link any libraries you depend on.

Tim Armstrong · ‎03-25-2019

This isn't possible unless you include a timestamp or sequence number in every record. There's no concept of an order of rows built into Hive or Impala.

Tim Armstrong · ‎03-25-2019

void FunnelInit(FunctionContext* context, StringVal* val) { EventLogs* eventLogs = new EventLogs(); val->ptr = (uint8_t*) eventLogs; // Exit on failed allocation. Impala will fail the query after some time. if (val->ptr == NULL) { *val = StringVal::null(); return; } val->is_null = false; val->len = sizeof(EventLogs); } I did another scan and the memory management in the above function is also slightly problematic - the memory attached to the intermediate StringVal would be better allocated from the Impala UDF interface so that Impala can track the memory consumption. E.g. see https://github.com/cloudera/impala-udf-samples/blob/bc70833/uda-sample.cc#L76 . I think the real issue though is the EventLogs data structure and lack of a Serialize() function. It's a somewhat complex nested structure with the string and vector. In order for the UDA to work, you need to have a Serialize() function that flattens out the intermediate result into a single StringVal. This is pretty unavoidable since Impala needs to be able to send the intermediate values over the network and/or write it to disk, and Impala doesn't know enough about your data structure to do it automatically. Our docs do mention this here https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html#udafs. Putting it into practice is a bit tricky. One working example is the implementation of reservoir sampling in Impala itself. Unfortunately I think it's a little over-complicated:https://github.com/apache/impala/blob/df53ec/be/src/exprs/aggregate-functions-ir.cc#L1067 The general pattern for complex intermediate values is to have a "header" that lets your determine whether the intermediate value is currently serialized, then either the deserialized representation, or the serialized representation after the "header" using a flexible array member or similar - https://en.wikipedia.org/wiki/Flexible_array_member. The Serialize() function will convert the representation by packing any nested structures into a single StringVal() with the header in front. Then other functions can switch back to the deserialized representation. Or in some cases, you can be clever and avoid the conversion in some case (that's what the reservoir sample function above is doing, and part of why it's overly complex). Anyway, a really rough illustration of the idea is as follows: struct DeserializedValue { ... } struct IntermediateValue { bool serialized; union { DeserializedValue val; char buf[0]; }; StringVal Serialize() { if (serialized) { // Just copy serialized representation to output StringVal } else { // Flatten val into an output StringVal } } void DeserializeIfNeeded() { if (serialized) { // Unpack buf into val. } } }; Just as a side note, the use of C++ builtin vector and string in the intermediate value can be problematic if they're large, since Impala doesn't account for the memory involved. But that's very much a second-order problem compared to it not working at all.

Tim Armstrong · ‎03-22-2019

delete src.ptr; <-- that is a bug that will definitely cause Impala to crash if you run the UDA enough times. Impala manages that memory and it's definitely not valid to free it yourself! The Impala runtime will automatically manage memory for StringVal inputs.

Tim Armstrong · ‎03-07-2019

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala. If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.

Tim Armstrong · ‎03-07-2019

Oh, the best reference for building Impala is the apache wiki. https://cwiki.apache.org/confluence/display/IMPALA/Building+native-toolchain+from+scratch+and+using+with+Impala is a bit more hidden and talks about how to build the third party dependencies

Tim Armstrong · ‎03-07-2019

You'd probably do better having a conversation about this on dev@impala.apache.org, that's where a lot of this kind of discussion happens. I can give a quick answer. No you can't build Impala without modifications on aarch64, it's x86-64 only at the moment. I imagine most of the third party code works with aarch64, but haven't tried it. It would require a bit of legwork to track down all the places that assume x86-64 (intrinsics like you mentioned, but also some places in the query compilation where we assume x86-64 calling convention). The good news is that aarch64 is little-endian and has good LLVM support, which removes two major obstacles.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala mem_limit query option is not working

Re: Impala mem_limit query option is not working

Re: Impala querying Hive partitions

Re: UDA function causes Impala to crash

Re: Get Last Insert in Impala partition

Re: UDA function causes Impala to crash

Re: UDA function causes Impala to crash

Re: Impyla bad performance - rows fetch is very sl...

Re: Has anyone tried building impala on ARM/aarch6...

Re: Has anyone tried building impala on ARM/aarch6...