Member since
07-29-2015
535
Posts
140
Kudos Received
102
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1993 | 12-18-2020 01:46 PM | |
1355 | 12-16-2020 12:11 PM | |
838 | 12-07-2020 01:47 PM | |
779 | 12-07-2020 09:21 AM | |
455 | 10-14-2020 11:15 AM |
03-02-2022
12:37 PM
Apparently not. The old CDH model seems gone with the introduction of CDP which appears using a pure subscription based model (i.e. without the old open source model co-existing as in the old CDH model). Of course, most components in CDP are still open source. The question concerns CDP as a whole (like CDH in before), not individual components.
... View more
04-30-2021
06:42 AM
@JasonBourne - if you have the same issue, here's a GitHub issue discussing it and linking to a pull request to fix it: https://github.com/cloudera/thrift_sasl/issues/28 You can see in the commits (here: https://github.com/cloudera/thrift_sasl/commits/master), they are testing a new release for a fix, but it looks like it's not quite done yet. Hopefully soon.
... View more
01-20-2021
09:38 AM
There's a 64kb limit on strings in Kudu but otherwise you can store any binary data in them. https://docs.cloudera.com/documentation/kudu/5-10-x/topics/kudu_known_issues.html#schema_design_limitations
... View more
01-19-2021
09:45 AM
Upgrading to a newer version of Impala will solve most scalability issues that you'd see on Impala 2.9, mostly because of https://blog.cloudera.com/scalability-improvement-of-apache-impala-2-12-0-in-cdh-5-15-0/.
... View more
12-22-2020
06:24 AM
@Tim Armstrong Thanks for helping out here. My apologies for mis-understanding w.r.t packing information.
... View more
12-21-2020
09:01 AM
We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!
... View more
12-18-2020
06:33 AM
We have restarted nearly every component of the affected HDFS cluster and impala performance has improved. Sadly that doesn't explain the underlying issue.
... View more
12-16-2020
12:11 PM
1 Kudo
In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability. I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that. For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.
... View more
12-15-2020
08:59 PM
You need to run compute stats on the base tables referenced by the views - compute stats directly on a view isn't supported.
... View more
12-10-2020
04:28 PM
Great news!
... View more
12-08-2020
09:36 AM
Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much. I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.
... View more
12-07-2020
09:21 AM
Slide 17 here has some rules of thumb - https://blog.cloudera.com/latest-impala-cookbook/ Can you mention what version you're running and whether you have any other non-standard configs set, e.g. load_catalog_in_background. We made some improvements in this area and have added some options in more recent versions.
... View more
12-07-2020
09:17 AM
These are good questions that come up frequently. https://docs.cloudera.com/runtime/7.2.2/administering-kudu/topics/kudu-security-trusted-users.html discusses the issue. In summary, Hive/Impala tables (i.e. those with entries in the Hive Metastore) are authorized in the same way, regardless of whether backing storage is HDFS, S3, Kudu, HBase, etc - the SQL interface does the authorization to confirm that the end user has access to the table, columns, etc, then the service accesses the storage as the privileged user (Impala in this case). In this model, if you create an external Kudu table in Impala and give permissions to a user to access the table via Impala, then they will have permissions to access the data in the underlying Kudu table. The thing that closes the loophole here is that creating the external Kudu table requires very high privileges - ALL permission on SERVER - a regular user can't create an external Kudu table pointed at an arbitrary Kudu cluster or table.
... View more
11-30-2020
09:14 AM
@Tim Armstrong any hints how to configure the JDBC connection to use impersonation? Assuming I use the recommended Cloudera drivers, can you send a code snippet that invokes a simple SQL query on behalf of some user Thanks!
... View more
11-19-2020
10:13 AM
You can use an expression incited of a query. in the expression , your query should be something like this. =" SELECT A.COL1 , A. COL2 FROM schema.tableName A WHERE A.COL1 = '" & Parameters!parameterName.Value & "' " Notice the Quotation marks besides the parameter ( " , ' ) and equal ( = ) sign at the beginning You should create fields manually( Use query designer without parameters and let SSRS do the Refresh Fields task)
... View more
11-13-2020
09:35 PM
could you give a working example of this in spark 2.4 using scala dataframe can't seem to find the correct syntax... val result = dataFrame.select(count(when( col("col_1") === "val_1" && col("col_2") === "val_2", 1)
... View more
10-23-2020
10:28 AM
1 Kudo
https://issues.apache.org/jira/browse/IMPALA-8454 is the apache impala jira
... View more
10-15-2020
04:13 AM
@Tim Armstrong it worked like charm after changing the gcc version. Thanks
... View more
10-15-2020
03:30 AM
Tried executing commit() or setting the timeout but no effect: import pypyodbc connection = pypyodbc.connect(DSN="", Schema="dbname", autocommit=True) cursor = connection.cursor() query = """INSERT INTO schema.table VALUES ('val1', 'val2')""" cursor.execute(query) cursor.commit() connection.close()
... View more
10-14-2020
11:15 AM
1 Kudo
On-demand metadata does not exist in C5.14.4. There was a technical preview version in C5.16+ and C6.1+ that had all the core functionality but did not perform optimally for all workloads and had some other limitations. After we got feedback and experience with the feature, we made various tweaks and fixes and in C6.3 we removed the technical preview caveat - https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_metadata.html and there and some important tweaks in patch releases after (i.e. 6.3.3). It is enabled by default in the latest versions of CDP. So basically if you want to experiment and see if it meets your needs, CDH5.16+ works, but CDH6.3.3+ or CDP has the latest and greatest.
... View more
10-14-2020
07:10 AM
Hi Tim, Your suggestion was very helpful. I have a good understanding now. I am accepting as a solution. I just have one more thing to ask, to fix the issue of the query utilizing the resources it is better to increase the Impala Daemon Memory Limit (mem_limit). what do you suggest?
... View more
10-14-2020
02:26 AM
Thanks for your input. We are running our stage cluster in "n on-production mode" using the embedded postgres. This postgres is also used by the two hive metastore servers. The postgres db is hosted on the same node as the other cloudera services. When this host now freezes the impala insert queries freeze as well. We were surprised to see that there seems to be no timeout from the hive metastore servers and their backing db (postgres) and no error either. This probably also happens when backed by an external postgres or mysql database, although not tested by us. I wonder if this might be solved by a newer CDH version. We are currently looking into upgrading and would like to do so for other reasons very much so.
... View more
09-28-2020
07:58 AM
Hi @PauloRC @Tim Armstrong , This might be a performance regression, but also in general a performance inefficiency with a specific planner data structure. A correctness fix for IMPALA-8386 may have introduced this perf regression in 3.2.1, IMPALA-9358 may resolve this issue, but I don't think it's available in any CDH 6.3 release yet. @PauloRC one thing to try which might mitigate the issue is to run your view query with SET ENABLE_EXPR_REWRITES=false to see if that helps.
... View more
09-23-2020
01:31 AM
Thank you for your reply Tim. Just to clarify, security-wise, are we better off with our current configuration (default), with sentry service disabled, or with sentry enabled in testing mode? You mentioned that sentry in testing mode does not authenticate the clients, but in the documentation it is mentioned that testing mode uses weaker authentication mechanisms. We need this in order to prevent our analysts from doing accidental writes, drops, etc. on the data. Our cluster is in a secure isolated environment.
... View more
09-21-2020
09:57 AM
1 Kudo
This is definitely a bug. Thanks for the clear report and reproduction. It's not IMPALA-7957 but is somewhat related. This is new to us so I filed https://issues.apache.org/jira/browse/IMPALA-10182 to track it. It looks like it can only happen when you have a UNION ALL, plus subqueries where the same column appears twice in the select list, plus NULL values in those columns. You can work around the issue by removing the duplicated entries in the subquery select list. E.g. the following query is equivalent and returns the expected results. SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c4 c4,
t_53.c4 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c4
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3 ) t_53
GROUP BY
4,
5,
6,
7
UNION ALL
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c5s0 c4,
t_53.c5s0 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c5s0
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3) t_53
GROUP BY
4,
5,
6,
7;
... View more
08-23-2020
02:19 PM
You need to cast one of the branches of the else to be a compatible type with the other one. The problem is that both decimal types have the max precision (38) and different scale and neither can be converted automatically to the other without potentially losing precision. A lot of the decimal behaviour such as result types of expressions was changed in CDH6 (and upstream Apache Impala 3.0). https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_decimal.html has a lot of related information.
... View more
08-15-2020
01:03 PM
I think the reality is now that both are great technologies and the overlap in use cases is pretty big - there are a lot of SQL workloads where either can work. I just wanted to clarify a few points. Impala does support querying complex types from Parquet - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html We also are working on a transparent query retry feature in Impala that should be released soon.
... View more
07-29-2020
10:19 AM
Yes we should be able to prune based on range partitions. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_kudu.html#kudu_partitioning has some examples of how to set up a table with both range and hash partitions. You can specify arbitrary timestamp ranges for the partitions. You can see in the Impala explain plan if your WHERE predicates were converted into kudu pushdown predicates (they're labelled kudu predicates).
... View more
07-28-2020
10:48 AM
Ahh 5.11, there's been so many Impala improvements since then! This happens when the Impala daemon can't load the initial catalog (i.e. database and table metadata). The catalog and statestore roles are both involved in the catalog loading, so if the impala daemon isn't able to communicate with those roles, or those are not started or healthy then that could lead to these symptoms. You should be able to see in Cloudera Manager if they're started and if there are any warnings or errors being flagged. It might also be just that the catalog is slow to load (maybe there's a lot of metadata or something else is unhealthy). You would need to look at the logs of the impala daemon you're connecting and maybe the catalog to see what it's doing and why its slow. I know this doesn't address your immediate problem, but we've seen a lot of these metadata/catalog problem go away with later versions - CDH5.16 or CDH6+, and particularly by moving to a dedicated coordinator/executor topology - https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/impala_dedicated_coordinator.html.
... View more
07-24-2020
01:21 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples). One scenario where this could happen is if a partition was dropped since the last compute stats was run. The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate. If you want to know the accurate counts, you can run queries like select count(*) from table; select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;
... View more