Member since
01-07-2020
64
Posts
1
Kudos Received
0
Solutions
10-07-2021
10:45 AM
@dr If it's a managed table, you could get its size from TABLE_PARAMS table: e.g. SELECT a.TBL_NAME AS `TABLE`, b.PARAM_VALUE AS `SIZE` from TABLE_PARAMS b INNER JOIN TBLS a where a.TBL_ID=b.TBL_ID and b.PARAM_KEY='totalSize'; You could change the you need it. But, if there are external tables, or the table stats are not generated regularly, then you might not get the correct data. You could get the table size using HDFS file system commands as well: hdfs dfs -du -s -h <path to the table location> This will give you more accurate data.
... View more
09-24-2021
08:15 PM
1 Kudo
Hi @drgenious, 1) where can I run these kind of queries? In CM -> Charts -> Chart Builder builder you can run tsquery. Refer to this link: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_dg_chart_time_series_data.html 2) where can I find the attributes like category and clusterName in cloudera? In Chart Builder text bar, write an incomplete query like: SELECT get_file_info_rate Below the text bar there is Facets, click on More, select any Facets you want, for example you select clusterName, then you will see a the clusterName shows in the chart's title. Then you can complete your tsquery: SELECT get_file_info_rate where clusterName=xxxxx If you want to build impala related charts, suggest to firstly review the CM > Impala service > Charts Library, many charts are already there for common monitoring purpose. You can open any of the existing charts to learn how to construct the tsquery and then build your own charts. Another very good place to learn is CM > Charts > Chart Builder, at right side you will see a "?" button, click on it you will see many examples and you could just cllick "try it". Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-22-2021
10:54 PM
2 Kudos
1. For total memory configured you can check (impala daemon memory * a total number of demons ) , these values should be displayed on top of Impala admission control as well that this much if memory is allocated to the impala. 2. You can check other memory metrics from the cluster utilization report, please note that how much memory is consumed per pool feature is not currently captured in impala metrics. a) Max Allocated Peak Allocation Time – The time when Impala reserved the maximum amount of memory for queries. Click the drop-down list next to the date and time and select View Impala Queries Running at the Time to see details about the queries. Max Allocated – The maximum memory that was reserved by Impala for executing queries. If the percentage is high, consider increasing the number of hosts in the cluster. Utilized at the Time – The amount of memory used by Impala for running queries at the time when maximum memory was reserved. Click View Time Series Chart to view a chart of peak memory allocations. Histogram of Allocated Memory at Peak Allocation Time – Distribution of memory reserved per Impala daemon for executing queries at the time Impala reserved the maximum memory. If some Impala daemons have reserved memory close to the configured limit, consider adding more physical memory to the hosts. b) Max Utilized Peak Usage Time – The time when Impala used the maximum amount of memory for queries. Click the drop-down list next to the date and time and select View Impala Queries Running at the Time to see details about the queries. Max Utilized – The maximum memory that was used by Impala for executing queries. If the percentage is high, consider increasing the number of hosts in the cluster. Reserved at the Time – The amount of memory reserved by Impala at the time when it was using the maximum memory for executing queries. Click View Time Series Chart to view a chart of peak memory utilization. Histogram of Utilized Memory at Peak Usage Time – Distribution of memory used per Impala daemon for executing queries at the time Impala used the maximum memory. If some Impala daemons are using memory close to the configured limit, consider adding more physical memory to the hosts. [1] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_cluster_util_custom.html#concept_jp4_4bh_hx [2] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_metrics_impala_daemon.html [3] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_metrics_impala_daemon_resource_pool.html
... View more
08-23-2021
04:07 PM
Hi, can you use beeline and type the below command then recreate the table : set parquet.column.index.access=false; this should make hive not use the index of your create table statement to map the data in your files, but instead it will use the columns names . hope this works for you. Best Regards
... View more
08-09-2021
05:40 PM
Hi, What is the query you are using to read the data from table? can you attach its "query profile" and coordinator logs to have a look? Regards, Chethan YM
... View more
07-18-2021
08:37 AM
All the hive related tables are stored under "hive" database in mysql. You can take mysql dump for a database hive and can prevent this from happening in the future. You can use command like: mysqldump -u root -p hive Reference: https://www.sqlshack.com/how-to-backup-and-restore-mysql-databases-using-the-mysqldump-command/
... View more
07-01-2021
11:23 AM
1 Kudo
@drgenious Primo Impala shares metadata [data about data] with HMS Hive Metastore. Impala uses HDFS caching to provide performance and scalability benefits in production environments where Impala queries and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes, making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache. Having said that when you restart impala you are discarding all the Cached Metadata [Location of table, permissions, query execution plans, or statistics] that makes it efficient. That explains why after the restart your queries are so slow. Impala is very efficient if it reads from data that is pinned in memory through HDFS caching. It takes advantage of the HDFS API and reads the data from memory rather than from disk whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where you specify HDFS paths. There is no better source of Impala information than Cloudera I will urge you to take time and read the below documentation to pin the option in your memory 🙂 Using HDFS Caching with Impala Configuring HDFS Caching for Impala There are 2 other options that you should think of as less expensive than restarting Impala I can't imagine you you have more than 70 data nodes INVALIDATE METADATA Is an asynchronous operation that simply discards the loaded metadata from the catalog and coordinator caches. After that operation, the catalog and all the Impala coordinators only know about the existence of databases and tables and nothing more. Metadata loading for tables is triggered by any subsequent queries. REFRESH Reloads the metadata synchronously. REFRESH is more lightweight than doing a full metadata load after a table has been invalidated. REFRESH cannot detect changes in block locations triggered by operations like HDFS balancer, hence causing remote reads during query execution with negative performance implications. The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA. INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive clients, such as SparkSQL: Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad. I hope that explains to you why and gives you options to use rather than warm start impala. If you know what table you want to query the run this before by qualify db. table name. This has saved me time with my data scientists and encapsulating them in their scripts is a good thing INVALIDATE METADATA [[db_name.]table_name] Recomputing the statistics is another solution Compute stats <table name>; COMPUTE STATS statement gathers information about the volume and distribution of data in a table and all associated columns and partitions. The information is stored in the Hive metastore database and used by Impala to help optimize queries. Hope that enlightens you.
... View more
06-17-2021
01:35 PM
Hi @drgenious , I believe this is possible by providing impala-shell with the following parameter: impala-shell -f /path/ --query_option='mem_limit=3gb' Let me know if that works. Regards, Alex
... View more
05-25-2021
10:12 AM
Hello, I haven't used Flume myself, but there is some mention of serializer.delimiter parameter in the Flume documentation. It would be helpful to know what the source of the data is (e.g. file on hdfs) and what the destination is (e.g. Hive). Also you should know that in Cloudera Data Platform, Flume is no longer a supported component. If you are just starting to learn it, I would recommend saving yourself some time and exploring NiFi, Kafka, and Flink (good starter blog post). Regards, Alex
... View more
- « Previous
-
- 1
- 2
- Next »