About Shelton

Shelton · ‎08-29-2021

@npr20202 I am sorry but you will have to continue being a magician 🙂 If you don't want then you have to teach your users the secret sauce or magic wand. We too face the same problems with multiple users Spark/Impala/PySpark I have made them add the INVALIDATE METADATA and REFRESH (spark) )command at the start of their queries and that works perfectly Else the automatic invalidate/refresh of metadata is enabled and available in CDP 7.2.10. As long as Impala depends on HMS that issue will exist 🙂 Happy hadooping

Shelton · ‎08-29-2021

@npr20202 That makes sense that the problem only crops up after the maintenance "reboot" of the Metasore host. Once the server is rebooted the metadata is purged from memory that explains the slowness of querries after a cluster restart. Automatic Invalidation/Refresh of Metadata Now an available option in CDP 7.2.10 When automatic invalidate/refresh of metadata is enabled, the Catalog Server polls Hive Metastore (HMS) notification events at a configurable interval and automatically applies the changes to Impala catalog. Impala Catalog Server polls and processes the following changes. Invalidates the tables when it receives the ALTER TABLE event. Refreshes the partition when it receives the ALTER, ADD, or DROP partitions. Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events. Removes the tables from catalogd when it receives the DROP TABLE or DROP DATABASE events. The HMS stores metadata for Hive tables schema, permissions, location, and partitions in a relational database providing clients access to this information by using metastore service API. Hive Metastore is a component in Hive that stores the catalog of the system that contains the metadata about Hive create columns, Hive table creation, and partitions. Impala uses the HIVE metastore to read the data created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data. Hive and impala are two different query engines. Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about schema objects such as tables and columns. Virtualization Discoverability Schema Evolution Performance Hive utilizes execution engines (like Tez, Hive on Spark, and LLAP) to improve query performance without low-level tuning approaches. Leveraging parallel execution whenever sequential operations are not needed is also wise. The amount of parallelism that your system can perform depends on the resources available and the overall data structure. Proper Hive tuning allows you to manipulate as little data as possible. One way to do this is through partitioning, where you assign “keys” to subdirectories where your data is segregated. Impala uses Hive metastore and can query the Hive tables directly. Unlike Hive, Impala does not translate the queries into MapReduce jobs like hive but executes them natively using its daemons running on the data nodes to directly access the files on HDFS . Created metadata is stored in the Hive Metastore‚ and is contained in an RDBMS such as MySQL/Oracle, MSSQL or MariaDB. Hive and Impala work with the same data tables in HDFS, metadata in the Metastore. Metadata information of tables created in Hive is stored in Hive "Meta storage database".

Shelton · ‎08-28-2021

@hadoclc Can you share more details? What job spark/hive? Can you share some information about your environment and the code submitted that fails? What is the permission of /user/yarn Who and how was the job executed in 7.1.4? Is the same user running the job in 7.1.6? Please share the logs?

Shelton · ‎08-27-2021

@vciampa The log clearly shows that the Address is already in use Caused by: java.net.BindException: Port in use: 0.0.0.0:8042 Caused by: java.net.BindException: Address already in use Can you proceed by locating the pid # lsof -i -P -n | grep LISTEN | grep 8042 Example # lsof -i:8042 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 9322 yarn 475u IPv4 294790 0t0 TCP *:fs-agent (LISTEN Kill using the PID $ kill -9 9322 Restart the service Please revert

Shelton · ‎08-26-2021

@npr20202 To return accurate query results, Impala needs to keep the metadata current for the databases and tables queried. Therefore, if some other entity modifies information used by Impala in the metastore, the information cached by Impala must be updated via INVALIDATE METADATA or REFRESH. Difference between INVALIDATE METADATA and REFRESH INVALIDATE METADATA is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. REFRESH Just reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications. Syntax INVALIDATE METADATA [[db_name.]table_name] You can run it in the HUE or impala-shell i.e INVALIDATE METADATA product.customer By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that one table is flushed. Even for a single table. INVALIDATE METADATA is more expensive than REFRESH, so prefer REFRESH in the common case where you add new data files for an existing table.

Shelton · ‎08-26-2021

@dv_conan I think your issue should be resolved with this posting Hive on tez issue Please let me know if that resolves your problem or not

Shelton · ‎08-24-2021

@Nitin0858 Did you enable your Ambari manually because those parameters you are referring to are set automatically when enabling through Ambari else if you did it manually as I suspect you need to perform the steps mentioned Set Up Kerberos for Ambari Server Please revert

Shelton · ‎08-24-2021

@lyash There are a couple of things members need to be able to help with your case. CDH or CDP version? OS? Your Postgres version and document followed for setup or steps executed? Memory allocated to the hive Can you connect to Postgres locally? ie sudo --login --user=postgres Can you change the hive.metastore.schema.verification to false in hive-site.xml Please revert

Shelton · ‎08-21-2021

@mike_bronson7 Can you share your capacity scheduler , total memory and vcores configs ?

Shelton · ‎08-21-2021

@Nitin0858 Can you share the 2 contents so we can help with the analysis?

Online	Offline
Last Visited	‎06-05-2025 02:03 PM

Member Since	‎01-19-2017 04:35 AM
Last Visited	‎06-05-2025 02:03 PM
Posts	3,676
Kudos received	627

Cloudera Community

Re: Apache nifi memory consumption in kubernetes

Re: Nifi toolkit command for GitLabFlowRegistry

Re: Not able to delete the NiFi existing flow usin...

Re: Securing Nifi with SSL and using OIDC provider...

Re: External zookeeper and nifi cluster connection...

Re: Impala does not show data accurately after clu...

Re: Impala does not show data accurately after clu...

Re: Job fails after installing CDP 7.1.6

Re: After restart server: nodemanager not start

Re: Impala does not show data accurately after clu...

Re: Hue Not working with hive on tez

Re: kerberos spnego authentication not working for...

Re: Hiver Server not starting

Re: HDP cluster + resource manager logs with warni...

Re: kerberos spnego authentication not working for...