Member since
01-07-2020
64
Posts
1
Kudos Received
0
Solutions
07-01-2021
11:23 AM
1 Kudo
@drgenious Primo Impala shares metadata [data about data] with HMS Hive Metastore. Impala uses HDFS caching to provide performance and scalability benefits in production environments where Impala queries and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache. Having said that when you restart impala you are discarding all the Cached Metadata [Location of table, permissions, query execution plans, or statistics] that makes it efficient. That explains why after the restart your queries are so slow. Impala is very efficient if it reads from data that is pinned in memory through HDFS caching. It takes advantage of the HDFS API and reads the data from memory rather than from disk whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where you specify HDFS paths. There is no better source of Impala information than Cloudera I will urge you to take time and read the below documentation to pin the option in your memory 🙂 Using HDFS Caching with Impala Configuring HDFS Caching for Impala There are 2 other options that you should think of as less expensive than restarting Impala I can't imagine you you have more than 70 data nodes INVALIDATE METADATA Is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. REFRESH Reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications. The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA. INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive clients, such as SparkSQL: Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad. I hope that explains to you why and gives you options to use rather than warm start impala. If you know what table you want to query the run this before by qualify db. table name. This has saved me time with my data scientists and encapsulating them in their scripts is a good thing INVALIDATE METADATA [[db_name.]table_name] Recomputing the statistics is another solution Compute stats <table name>; COMPUTE STATS statement gathers information about the volume and distribution of data in a table and all associated columns and partitions. The information is stored in the Hive metastore database and used by Impala to help optimize queries. Hope that enlightens you.
... View more
06-17-2021
11:25 PM
@ask_bill_brooks Hi, I have a table with a lot of data and I want to make the data be sorted in the level of the table in order to not put order by in my queries. What I want is to sort the main table immediately instead of transfer this data to another sorted table because it is time consuming. The table has been created without sort by. Is there any way to alter the configurations of the table. For example alter table <table_name> order by <column>
... View more
06-17-2021
01:35 PM
Hi @drgenious , I believe this is possible by providing impala-shell with the following parameter: impala-shell -f /path/ --query_option='mem_limit=3gb' Let me know if that works. Regards, Alex
... View more
06-11-2021
08:50 AM
This error indicates that impalad daemon was not able to secure a processing thread on the node. Could be because the node is struggling under heavy load. What services are running in parallel with Impala on this host? What is the hardware configuration? What is the CPU utilization on the host?
... View more
05-25-2021
10:12 AM
Hello, I haven't used Flume myself, but there is some mention of serializer.delimiter parameter in the Flume documentation. It would be helpful to know what the source of the data is (e.g. file on hdfs) and what the destination is (e.g. Hive). Also you should know that in Cloudera Data Platform, Flume is no longer a supported component. If you are just starting to learn it, I would recommend saving yourself some time and exploring NiFi, Kafka, and Flink (good starter blog post). Regards, Alex
... View more
04-21-2021
12:12 AM
Hello One example: https://stackoverflow.com/questions/44235019/delete-files-older-than-10days-on-hdfs
... View more
03-22-2021
12:10 AM
2 Kudos
Some community posts: https://data-flair.training/blogs/impala-udf/ https://bigdatalatte.wordpress.com/2015/06/04/writing-java-udfs-in-impala/ https://blog.clairvoyantsoft.com/impala-udf-in-c-cf8a8f4a17c9
... View more
11-04-2020
01:32 AM
@drgenious Could you please connect to impala-shell and submit the same query just to bee confirmed that the error is not from impala.
... View more
04-09-2020
01:56 AM
1 Kudo
Hi @drgenious Are you getting a similar error which reported in KUDU-2633 It seems this is open JIRA reported in the community ERROR core.JobRunShell: Job DEFAULT.EventKpisConsumer threw an unhandled Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 109.0 because task 3 (partition 3) cannot run anywhere due to node and executor blacklist. Blacklisting behavior can be configured via spark.blacklist.*. If you have the data in HDFS in (csv/avro/parquet) format, then you can use the below command to import the files to Kudu table. Prerequisites: Kudu jar with compatible version (1.6 or higher) For more reference spark2-submit --master yarn/local --class org.apache.kudu.spark.tools.ImportExportFiles <path of kudu jar>/kudu-spark2-tools_2.11-1.6.0.jar --operation=import --format=<parquet/avro/csv> --master-addrs=<kudu master host>:<port number> --path=<hdfs path for data> --table-name=impala::<table name> Hope this helps. Please accept the answer and vote up if it did.
... View more
- « Previous
- Next »