Member since
07-29-2015
535
Posts
140
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5798 | 12-18-2020 01:46 PM | |
3729 | 12-16-2020 12:11 PM | |
2645 | 12-07-2020 01:47 PM | |
1892 | 12-07-2020 09:21 AM | |
1224 | 10-14-2020 11:15 AM |
12-15-2020
08:59 PM
You can limit the aggregate memory that any one pool will consume. There isn't exactly a priority option (there's no ability to pre-empt queries once they are running)
... View more
12-14-2020
08:38 AM
Impala can query views. Computing table stats on tables accessed by Impala queries is necessary to get the best performance, particularly for complex queries. That's probably not the cause of whatever your user saw, but you need to include a query profile or the query status error message at least for us to give any tips about.
... View more
12-14-2020
08:35 AM
1 Kudo
You want to enable memory-based admission control - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_control . Without that enabled memory reservation for queries is best effort - queries just run and get whatever memory they ask for until memory is exhausted. With it enabled queries will get allocated specific amounts of memory and queries will get queued when memory is low. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html is a good starting point. I'd recommend setting a minimum and maximum memory limit, probably a minimum of ~1GB and a maximum of whatever you're comfortably with a single query being given. I also gave a talk a while ago that gives an overview of some things - https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/73000.html. That all said, scheduling is based on data locality/affinity - the read of each input file is scheduled on a node with local replica of that file. There's also affinity to bias scheduling towards a single replica, so that the same data is read on the same node as much as possible. This minimizes network traffic and maximizes use of the OS buffer cache (i.e. maximises likelihood of reading the data from memory instead of disk).
... View more
12-10-2020
04:28 PM
Great news!
... View more
12-08-2020
09:36 AM
Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much. I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.
... View more
12-07-2020
01:47 PM
If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.html has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there. One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.
... View more
12-07-2020
09:21 AM
Slide 17 here has some rules of thumb - https://blog.cloudera.com/latest-impala-cookbook/ Can you mention what version you're running and whether you have any other non-standard configs set, e.g. load_catalog_in_background. We made some improvements in this area and have added some options in more recent versions.
... View more
12-07-2020
09:17 AM
These are good questions that come up frequently. https://docs.cloudera.com/runtime/7.2.2/administering-kudu/topics/kudu-security-trusted-users.html discusses the issue. In summary, Hive/Impala tables (i.e. those with entries in the Hive Metastore) are authorized in the same way, regardless of whether backing storage is HDFS, S3, Kudu, HBase, etc - the SQL interface does the authorization to confirm that the end user has access to the table, columns, etc, then the service accesses the storage as the privileged user (Impala in this case). In this model, if you create an external Kudu table in Impala and give permissions to a user to access the table via Impala, then they will have permissions to access the data in the underlying Kudu table. The thing that closes the loophole here is that creating the external Kudu table requires very high privileges - ALL permission on SERVER - a regular user can't create an external Kudu table pointed at an arbitrary Kudu cluster or table.
... View more
11-30-2020
09:09 AM
@PyMeH that's not right. The Impala JDBC driver does use the HS2 protocol - JDBC is the java language interface and HS2 is the client-server network protocol. You should be able to use impersonation with JDBC. You'd need to configure Impala to allow a particular user to delegate - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_delegation.html Then there is a DelegationUID option for the driver that I believe specifies the user to delegate to - https://docs.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf
... View more
11-12-2020
08:05 PM
Impala would probably give you the fastest response time. Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one. You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount). E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause). Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end)
from MyTable
... View more