Member since
11-17-2017
76
Posts
7
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1706 | 05-11-2020 01:31 AM | |
370 | 04-14-2020 03:48 AM | |
2354 | 02-04-2020 01:29 AM | |
550 | 10-17-2019 01:26 AM | |
1427 | 09-24-2019 01:46 AM |
02-15-2021
03:01 AM
Hi @jayGenesis , Impala supports simple bind authentication in CDH 6.3. The documentation for reference: LDAP BaseDN (--ldap_baseDN)
Replaces the username with a distinguished name (DN) of the form: uid=userid,ldap_baseDN. (This is equivalent to a Hive option).
LDAP Pattern (--ldap_bind_pattern)
This is the most general option, and replaces the username with the string ldap_bind_pattern where all instances of the string #UID are replaced with userid. For example, an ldap_bind_pattern of "user=#UID,OU=foo,CN=bar" with a username of henry will construct a bind name of "user=henry,OU=foo,CN=bar". This means that with the mentioned base dn configured there will be a bind request from Impala towards the LDAP server with uid=<username>,ou=users,dc=ldap,dc=xxx,dc=com user dn and its password, if this user does not exist the authentication will fail. Does the mentioned user exist in the LDAP directory?
... View more
05-11-2020
08:00 AM
Hi @HimaV, Tableau uses its own configuration method and can overwrite the Windows System DSN configs. How was the DSN configured? Additionally, checking the trace level logs can give further information on why the connection failed, the Configuring Logging Options on Windows chapter in the ODBC doc describes how to set the log level.
... View more
05-11-2020
03:52 AM
Hi @lev, I assume you are looking for a way to authenticate with Impala JDBC similar to Hive JDBC described in chapter 'Using a Hadoop Delegation Token' here. Impala JDBC/ODBC does not support this method, the current way to authenticate is to pass either the keytab or an LDAP password to the application. For example with a shell script that does the initialization.
... View more
05-11-2020
01:31 AM
1 Kudo
Hi @parthk, This is a tough question, because when discussing S3 access multiple components come into picture: First and foremost S3, S3 Select only supports CSV and JSON format at the moment, while Impala/Hive favors columnar storage formats Parquet/ORC in general. With just a couple of fields to filter on a partition strategy should possibly have the similar results with Parquet/ORC, I have not tested this, it would need some perf test on the datasets. Secondly Impala/Hive connects to S3 with the Hadoop S3A client, which is in the hadoop-aws module. An experimental S3 Select feature doc can be found here. Lastly the third party component has to support it as well. I spent some time on the AWS Hive S3 Select support and it seem to be a closed source INPUTFORMAT solution, I could not find 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' anywhere. Digging a bit more I have found that upstream Hive does not support S3 Select either, the upstream Jira is HIVE-21112. I hope this 10,000 foot view helps, it is hard to answer questions about the future.
... View more
04-14-2020
03:48 AM
Hi @Jumo, CDH 6.3.x is packaged with Impala 3.2, the packaging details can be found on this page. The 2.6.9 Impala ODBC driver can be used with on CDH 6.3.x. I understand that the recommendation can be confusing and reached out internally to update the documentations.
... View more
02-07-2020
01:29 AM
Hi @drgenious, With the given details the issue is a bit blurry to me. To be able to effectively advise you, could you share some additional details so we could gain further context on the issue you are facing. Could you advise on the data ingestion process, how the data reaches the storage layer, such as HDFS/S3/HBase? Could you elaborate more on the parsing process? What components are involved and how it is being done? Which part of the process is failing and what error message can you see when this happens?
... View more
02-04-2020
05:13 AM
Hi @tomtobin, Cloudera ships Simba drivers as the native drivers of the Cloudera platform. Simba drivers are not open sourced. What issues were you experiencing with the drivers and with which version?
... View more
02-04-2020
04:51 AM
Hi @jaya123, Earlier when I have seen ImpalaThriftAPICallFailed it was due to a connection timeout between the client and Impala. Through ODBC/JDBC the connection can become inactive when Impala is executing the query, if the client tries to use a closed connection the call fails. The client might be trying to close the connection as well. Possible causes of the connection termination could be: A load balancer terminates the idle connection The driver closes the connection because SocketTimeout is reached The TRACE level driver logs can help to identify how and when was the connection terminated the next steps could be: Enable TRACE level driver logging, the log level and the log path has to be configured, please see our documentation here. This configuration is often client specific. Open the connection logs and look for the ImpalaThriftAPICallFailed message. Check the earlier messages, there could be other errors or the connection was probably closed just before the client request. The timestamps should help identify which timeout was reached, the SocketTimeout is 30s by default. The timeouts could be reached because a slow query execution or because the client did not close the query. As the dashboard is refreshing it is probably because the client does not close the query. Just in case the query speed should be checked, if that is fine then the socket timeout could be increased a bit to give time for the client to close the query.
... View more
02-04-2020
01:29 AM
1 Kudo
Hi @kentlee406, From the images it looks like that Kudu is not installed on the QuickStart VM: Kudu service can not be seen in the cluster services list Impala can not see any Kudu service on the config page Could you try adding Kudu service to the cluster, please see the steps in our documentation here.
... View more
11-13-2019
02:44 AM
The current newest CDH release 6.3.2 has a patched Hive 2.1.1. With a major release difference I believe there will be both HMS schema difference and HMS API difference as well. Depending on the use-case, during the POC period, the data/metadata could be migrated to the CDH cluster and work on the performance. Later, when Impala is well-tried a workflow could be built where the clusters are working on the tasks that are the most suitable for specific components.
... View more
11-12-2019
12:55 AM
Hi @pauljoshiva, In theory it should be possible, however CDH and HDP releases are not tested together and shipped with different Hive Metastore versions, the unity release will be CDP. I can see 2 possible approaches, please note that I have not tried these and there might be skeletons in the closet: Using the CDH HMS binaries to connect to the central HMS backend database. The main problem could be the HMS schema which can differ in releases, especially between major releases, for example HDP 3.x is shipped with Hive 3, HDP 2.6.x is shipped with Hive 2, while CDH 6.x is packaged with a patched Hive 2, although some Hive 3 fixes can be available in CDH 6 as well. The metastore schema compatibility between releases can be verified with the Metastore Schema tool , this could rule out the feasibility of this option fast. Also, DBTokenStore should be enabled for both HMS. Pointing Impala to use the HDP HMS. There might be API differences between the HMS binaries that could cause unexpected Impala behavior. This can be mitigated by picking versions as close as possible, however due to the nature of the CDH Hive release, as it is patched with newer fixes, there could still be differences. Additionally, would recommend creating a backup of the databases that can be affected and contain important metadata.
... View more
11-11-2019
08:54 AM
1 Kudo
Hi @mrmikewhitman, Based on the error message it appears to be a certification issue. I would start by verifying if the certificate is valid with openssl and check if it works with impala-shell to connect to Impala. Additionally, with a proxy installed there are further requirements, please see them here.
... View more
11-11-2019
05:58 AM
Hi @Asad, Impala does not fully support unicode characters at the moment, please see 'Character sets' chapter of our documentation here for more information. Could you advise if the data is stored in UTF-8?
... View more
11-11-2019
05:39 AM
1 Kudo
Hi @Rahulwp, Impala Assignment Locality check whether recent IO tasks are operating on local data. Therefore many things can cause the problem. The most simple one is that HDFS DataNodes are not co-located with the ImpalaD roles, or some roles might be stopped. The error can also appear if Impala daemons are crashing but in this case other health issues should appear as well. However, the error message says: "bad: 100.00% of assignments operating on local data" which is the expected behavior. Based on the error message I would recommend checking: The Impala Assignment Locality threshold in Cloudera Manager, it might have been configured to alert on 100%. Defaults can be found here: Impala Assignment Locality Whether the Impala daemon/HDFS DataNode roles are co-located.
... View more
10-17-2019
01:26 AM
Hi @ChineduLB, UDFs let you code your own application logic for processing column values during an Impala query. Adding a refresh/invalidate to it could cause unexpected behavior during value processing. A general recommendation for Invalidate metadata/Refresh is to execute it after the ingestion finished. This way the Impala user does not have to worry about the staleness of the metadata. There is a blogpost on how to handle "Fast Data" and make it available to Impala in batches: https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ Additionally, just wanted to mention that the Invalidate metadata/Refresh can be executed from beeline as well, just need to connect from beeline to Impala, this blogpost has the details: https://www.ericlin.me/2017/04/how-to-use-beeline-to-connect-to-impala/
... View more
10-11-2019
06:59 AM
Hi @Shruhti, This indeed odd, my first assumption would be that the 'select 1' queries are triggered by a client application such as a BI tool silently. Maybe to check/keep the connection alive? Might worth checking the trace level driver logs, that could verify if the queries are coming from a tool/application. This can be done by changing the driver log level, which is described here for ODBC. Additionally, the query profile contains a Network Address as well, this should help confirm whether the source of the query is valid.
... View more
10-11-2019
06:32 AM
Hi @Nisha2019, This example seems like a snippet from our documentation here. Just above this example DESCRIBE statement there is a sample CREATE TABLE query that generates this table schema, please see bellow. As per ingesting data into these tables, Impala does not support creating data with complex type columns currently, Loading Data Containing Complex Types describes it in more detail. Additionally, some more information can be found in the Complex type considerations chapter. Hive does not support inserting values to a parquet complex type one-by-one either, but there are two solutions: Creating a temporary table with values, then transform it to Parquet complex type with Hive, please see our documentation here for sample queries: Constructing Parquet Files with Complex Columns Using Hive Using INSERT INTO ... SELECT <values> query, for inserting records one by one, reference queries can be found in the description of IMPALA-3938. Please note that this will generate separate files for each records that occasionally need to be compacted. CREATE TABLE struct_demo
(
id BIGINT,
name STRING,
-- A STRUCT as a top-level column. Demonstrates how the table ID column
-- and the ID field within the STRUCT can coexist without a name conflict.
employee_info STRUCT < employer: STRING, id: BIGINT, address: STRING >,
-- A STRUCT as the element type of an ARRAY.
places_lived ARRAY < STRUCT <street: STRING, city: STRING, country: STRING >>,
-- A STRUCT as the value portion of the key-value pairs in a MAP.
memorable_moments MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>,
-- A STRUCT where one of the fields is another STRUCT.
current_address STRUCT < street_address: STRUCT <street_number: INT, street_name: STRING, street_type: STRING>, country: STRING, postal_code: STRING >
)
STORED AS PARQUET;
... View more
09-24-2019
01:46 AM
1 Kudo
@eMazarakis, later releases do not support asterisk either, it will be treated as a literal. The expressions that are available can be found here in chapter 'To drop or alter multiple partitions'. Previously, I was referring to the intention behind "part_col='201801*' ", it suggests that the desired outcome of this expression would be to remove all data from January 2018 in one operation. However, as it is not possible in CDH 5.9, I was proposing to choose a different partition strategy if multiple partitions have to be dropped frequently and the size of the data allows. For example, if after ingestion only 1 analytic query is executed on the data, then the days have to be dropped one-by-one, which is 32 operations. Therefore, if the size of the data allows, the number of operations could be reduced to 2 with a different partition strategy where the table is partitioned by month.
... View more
09-23-2019
01:27 AM
@ravikumashi, if the table is altered by another user, the next Impala query likely to fail. Parquet stores the schema internally per column chunks, I assume changing this schema would mess up the addressing in the Parquet file, please see the file format details here. I can see two possible solutions, please note that I am unaware of the use-case: Option 1: Creating a new table with the new schema and re-creating the parquet files in Impala with INSERT INTO SELECT. At the end of this operation new parquet files with the new schema will be created in the new location. Option 2: With UNION the two schemata can be merged. For this the new/old data has to be split to two tables. Additionally, a VIEW could be created to hide this abstraction. Examples for the above two options, let me know if these are suitable for this use-case. # Test tables CREATE TABLE parquet_1 (id STRING, value DECIMAL(2,2)); INSERT INTO parquet_1 values ('1', 0.11); CREATE TABLE parquet_2 (id STRING, value DECIMAL(4,3)); INSERT INTO parquet_2 values ('2', 1.111); # Option 1 INSERT INTO TABLE parquet_2 SELECT * FROM parquet_1; # Option 2 SELECT * FROM parquet_1 UNION ALL SELECT * FROM parquet_2;
... View more
09-17-2019
01:12 AM
@eMazarakis, apologies, I meant the part_col='201801*' intends to remove a whole month. If possible might worth reconsidering the partition strategy or the drop operation could be done in Hive separately. From CDH 5.10+ partition expressions can be specified, please see my response here for details.
... View more
09-16-2019
08:18 AM
Hi @eMazarakis, Prior IMPALA-1654 the AlterTableDropPartitionStmt was working with PartitionSpec which is a collection of partition key/values. While from Impala 2.8+ and CDH 5.10+ IMPALA-1654 is available and the AlterTableDropPartitionStmt uses PartitionSet which can be a partition expression as well. In CDH 5.9 and earlier the partitions have to be specified manually in the ALTER TABLE DROP PARTITION statement.
... View more
09-16-2019
07:51 AM
Hi @ravikumashi, This issue happens because individual INSERT statements open new parquet files, which means that the new file is created with the new schema. Although, Hive is able to read parquet files where the schema has different precision than the table metadata this feature is under development in Impala, please see IMPALA-7087 . The parquet schema can be checked with "parquet-tools schema", it is deployed with CDH and should give similar outputs in this case like this: # Pre-Alter # parquet-tools schema f04187260a7a7eb5-b49e8a5000000000_938468120_data.0.parq message schema { optional fixed_len_byte_array(4) prec (DECIMAL(7,2)); } # Post-Alter # parquet-tools schema f14ce5c3b035caed-8f4c958400000000_1136483050_data.0.parq message schema { optional fixed_len_byte_array(4) prec (DECIMAL(9,6)); }
... View more
09-16-2019
01:45 AM
Hi @eMazarakis, Thank you for the additional information. Altering multiple partitions was implemented in IMPALA-1654, this feature is available from Impala 2.8+ which is part of CDH 5.10+. Although, I am not aware of the workflow and the amount of data behind a partition, this specific expression part_col='201801* removes a whole month. If these requests are frequent and the workload/workflow allows it re-partitioning based on months could be a feasible workaround.
... View more
09-05-2019
01:59 AM
Hi eMazarakis, Multiple partitions can be dropped with the following syntax: alter table historical_data drop partition (year = 1996 and month between 1 and 6); Please see our ALTER TABLE Statement documentation for more details, the multiple partition drop can be found in section: To drop or alter multiple partitions.
... View more
12-11-2018
12:33 AM
Was Impala delegation configured for MicroStrategy?
... View more
12-05-2018
03:15 AM
Thank you for the report. This resembles IMPALA-3983, however it has been fixed in CDH 5.10.2 and the temporary jars are being removed at the end of the extractFunctions method. The output of lsof shows that the files were deleted but the space might not be freed from disk as catalogd is keeping the file open. A possible root cause could be is that another process is removing the files before the CatalogD could remove it.
... View more
12-03-2018
09:54 AM
Hi, The queries can be collected through the Cloudera Manager API, as CM automatically collects Impala queries, this way there is no need to visit every Coordinator node one by one. The usage can be found in the Cloudera Manager API documentation, while further endpoints can be found here. I believe one of the endpoints you are looking for is impalaQueries: /clusters/{clusterName}/services/{serviceName}/impalaQueries
... View more
12-03-2018
09:01 AM
2 Kudos
Impala checks the file formats here based on this enumeration. Currently skipping complex columns in scans is not supported for Avro.
... View more
12-03-2018
01:26 AM
This looks like IMPALA-6973, Impala is checking 'auth_to_local' for the user authentication but not for the delegated user. As per the JIRA the workaround is to use uppercase when specifying the <user allowed to delegate> for Impala. What do you think?
... View more