About Tim Armstrong

Tim Armstrong · ‎12-15-2020

You can limit the aggregate memory that any one pool will consume. There isn't exactly a priority option (there's no ability to pre-empt queries once they are running)

Tim Armstrong · ‎12-14-2020

Impala can query views. Computing table stats on tables accessed by Impala queries is necessary to get the best performance, particularly for complex queries. That's probably not the cause of whatever your user saw, but you need to include a query profile or the query status error message at least for us to give any tips about.

Tim Armstrong · ‎12-14-2020

You want to enable memory-based admission control - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_control . Without that enabled memory reservation for queries is best effort - queries just run and get whatever memory they ask for until memory is exhausted. With it enabled queries will get allocated specific amounts of memory and queries will get queued when memory is low. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html is a good starting point. I'd recommend setting a minimum and maximum memory limit, probably a minimum of ~1GB and a maximum of whatever you're comfortably with a single query being given. I also gave a talk a while ago that gives an overview of some things - https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/73000.html. That all said, scheduling is based on data locality/affinity - the read of each input file is scheduled on a node with local replica of that file. There's also affinity to bias scheduling towards a single replica, so that the same data is read on the same node as much as possible. This minimizes network traffic and maximizes use of the OS buffer cache (i.e. maximises likelihood of reading the data from memory instead of disk).

Tim Armstrong · ‎12-10-2020

Great news!

Tim Armstrong · ‎12-08-2020

Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much. I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.

Tim Armstrong · ‎12-07-2020

If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.html has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there. One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.

Tim Armstrong · ‎12-07-2020

Slide 17 here has some rules of thumb - https://blog.cloudera.com/latest-impala-cookbook/ Can you mention what version you're running and whether you have any other non-standard configs set, e.g. load_catalog_in_background. We made some improvements in this area and have added some options in more recent versions.

Tim Armstrong · ‎12-07-2020

These are good questions that come up frequently. https://docs.cloudera.com/runtime/7.2.2/administering-kudu/topics/kudu-security-trusted-users.html discusses the issue. In summary, Hive/Impala tables (i.e. those with entries in the Hive Metastore) are authorized in the same way, regardless of whether backing storage is HDFS, S3, Kudu, HBase, etc - the SQL interface does the authorization to confirm that the end user has access to the table, columns, etc, then the service accesses the storage as the privileged user (Impala in this case). In this model, if you create an external Kudu table in Impala and give permissions to a user to access the table via Impala, then they will have permissions to access the data in the underlying Kudu table. The thing that closes the loophole here is that creating the external Kudu table requires very high privileges - ALL permission on SERVER - a regular user can't create an external Kudu table pointed at an arbitrary Kudu cluster or table.

Tim Armstrong · ‎11-30-2020

@PyMeH that's not right. The Impala JDBC driver does use the HS2 protocol - JDBC is the java language interface and HS2 is the client-server network protocol. You should be able to use impersonation with JDBC. You'd need to configure Impala to allow a particular user to delegate - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_delegation.html Then there is a DelegationUID option for the driver that I believe specifies the user to delegate to - https://docs.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf

Tim Armstrong · ‎11-12-2020

Impala would probably give you the fastest response time. Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one. You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount). E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause). Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end) from MyTable

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala queries are not distributing to all the...

Re: Impala query time out's

Re: Impala queries are not distributing to all the...

Re: GET_COLUMS when launching queries through ODBC

Re: impala - `recover partitions` points to old da...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Kudu-impala security

Re: Does Impala support Impersonation?

Re: Run Multiple Count Operation On Data Table