Member since
01-07-2020
64
Posts
1
Kudos Received
0
Solutions
04-03-2024
02:08 AM
Hi @drgenious First, please test your script outside of Oozie. If it is working outside of Oozie, then it should work from Oozie as well. As for error "No module named impala.dbapi" it could be that there is some version dependency issue with impyla and its related libraries refer ---> https://github.com/cloudera/impyla/issues/227
... View more
02-21-2024
06:43 AM
Got to know from the dev team that they have modified the column definition. We ran the MSCK repair table and we are able to run the select distinct query. Vertex errors may not relate to memory issues. Hope this helps the community.
... View more
11-21-2023
05:43 AM
The error message indicates that there is an inconsistency between the expected schema for the column 'db.table.parameter_11' and the actual schema found in the Parquet file 'hdfs:/path/table/1_data.0.parq'. The column type is expected to be a STRING, but the Parquet schema suggests that it is an optional int64 (integer) column. To resolve this issue, you'll need to investigate and potentially correct the schema mismatch. Here are some steps you can take: Verify the Expected Schema: Check the definition of the 'db.table.parameter_11' column in the Impala metadata or Hive metastore. Ensure that it is defined as a STRING type. Inspect the Parquet File Schema: You can use tools like parquet-tools to inspect the schema of the Parquet file directly. Run the following command in the terminal: bash parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. Compare Expected vs. Actual Schema: Compare the expected schema for 'db.table.parameter_11' with the actual schema found in the Parquet file. Identify any differences in data types. Investigate Data Inconsistencies: If there are data inconsistencies, investigate how they might have occurred. It's possible that there was a schema evolution or a mismatch during the data writing process. Resolve Schema Mismatch: Depending on your findings, you may need to correct the schema mismatch. This could involve updating the metadata in Impala or Hive to match the actual schema or adjusting the Parquet file schema. Update Impala Statistics: After resolving the schema mismatch, it's a good practice to update Impala statistics for the affected table. This can be done using the COMPUTE STATS command in Impala: This step ensures that Impala has up-to-date statistics for query optimization. Here's a high-level example of what the Parquet schema inspection might look like: parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. If the data type in the Parquet schema is incorrect, you may need to investigate how the data was written and whether there were any issues during that process. Correcting the schema mismatch and updating Impala statistics should help resolve the issue.
... View more
12-18-2022
05:16 AM
@drgenious This is an OS-level issue that will need to be addressed at the OS level by the system admin. The bottom line here is that thrift-0.9.2 needs to be uninstalled There are various things that could be happening:
1) Multiple python versions.
2) Multiple pip versions.
3) Broken installation. Solution: 1
- You can try to create the Python virtual environment to connect to impala-shell
virtualenv venv -p python2
cd venv
source bin/activate
(venv) impala-shell Solution : 2 (i) Remove easy-install.pth files available in,
/usr/lib/python2.6/site-packages/
/usr/lib64/python2.6/site-packages/
(ii) Try running impala-shell If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
... View more
11-23-2022
01:00 AM
@drgenious 1. Impala is always faster. Impala does not use yarn. Impala stores catalog data locally which fetches information faster. Impala backend gthread is built on C++ which is very fast. 2. Impala is not fault tolerant , it is best suited for adhoc queries and ETL is best suited for Hive as Hive is fault tolerant. If the query fails due to network/disk failure,hive will retry but Impala would fail. 3. For stemaming/ingestion like Kafka flow you need to put it in EXTERNAL tables not in Managed(ACID) tables. Managed tabled can be used,if you want to perform alteration of the data like Update/Delete . Please let me know,if you have any queries. Please click "Accept As Solution" , if your query is answered.
... View more
07-04-2022
03:16 AM
@drgenious, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
... View more
01-08-2022
07:17 PM
Can you take a look at the below links for creating the ORC table with the Snappy compression question? https://community.cloudera.com/t5/Support-Questions/Data-Compression-Doesn-t-work-in-ORC-with-SNAPPY-Compression/td-p/172151 https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97110 https://community.cloudera.com/t5/Community-Articles/Performance-Comparison-b-w-ORC-SNAPPY-and-ZLib-in-hive-ORC/ta-p/246948
... View more
01-08-2022
04:40 AM
@drgenious You can use set hive.merge.tezfiles=true; to fix merge file issue
... View more
11-11-2021
09:14 AM
Hello @drgenious Thanks for using Cloudera Community. We hope the response by @balajip was helpful for your ask. Additionally, We wish to share a few details: Your Question points to "How To Make Query Faster". Ideally, Impala would use Parallelism for executing a Query in fragments across Executors. As such, the 1st review should be using Impala Query Profile of the SQL to identify the Time taken at each Phase of SQL Execution. Refer [1] & [2] for few Links around Impala Query Profile. Once the Phase taking the Most Time is identified, Fine-Tune accordingly. Simply increasing the Impala Executors Daemon or using a Dedicated Coordinator may not be helpful, unless the SQL's Slow Fragment(s) are identified. Kindly review & let us know if you have any further ask in the Post. Regards, Smarak [1] https://cloudera.ericlin.me/2018/09/impala-query-profile-explained-part-1/ [2] https://docs.cloudera.com/runtime/7.2.10/impala-reference/topics/impala-profile.html
... View more
10-12-2021
01:54 AM
Hello @drgenious, Please check the below link [0]. [0]https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_metrics_impala_daemon_resource_pool.html#concept_gif_9en_yk
... View more