About PranayMunshi

PranayMunshi · ‎04-17-2019

Thank you very much Tim. Provided link has clarified my doubt.

PranayMunshi · ‎04-17-2019

Hi Friends, I have little doubt about Impala using fair schedulre for launching job. I am reading about Impala since 3 months but I never come across about Impala using fair scheduler instead it has its own mechanism for resource allocation. Is there any situation where Impala uses fair scheduler during execution of query? I have one more doubt about IMPALA using YARN. I want to the scenario/condition when we have to use YARN with IMPALA, becuase IMPALA has its own execution engine. When we have to use YARN with IMPALA? I think Llama is meant for that only.

PranayMunshi · ‎04-17-2019

Thanks Tim, How long Impala keeps/caches this metadata? If statistics of tables which are participating in query are not available then will it be available after first run? What if I run the query after big interval, then also will metadata be available in cache? What is my cluster or Impala is restarted? Does Impala perform some activity to get statistics of all the participating tables for the first time if statistics are not available and keep it in metastore or some where in DB?

PranayMunshi · ‎04-16-2019

Hi Friends, I am trying to run IMPALA queries using query options. I am trying to analyze the resultant query attributes. What I have observed that when I run the query for the first time, it took long time than I run the query second time or onwards. I want to know the reason behind difference in time? Does this happen only when table statistics are not available or it happens all the time. Is my observation right?

PranayMunshi · ‎04-11-2019

Hi friends, I am working on optimizing the Impala Query performance, for that purpose I am using query options in JDBC URL. After running the query I am trying to analyse the Plan, Text Plan, Summary and Profile. (http://host:25000/query_plan?query_id=fsdfsf432a:997fre34ss000000) But I don't which thing I should consider and how to interpret? I can see lot of terminologies but not able to understand any of them. I could not come to conclusion for performance. I don't which thing I need to modify after observing the Text Plan or Summary or Profile. Which is more important Summary or Text Plan or Profile? Could anyone please tell guide me how to understand/interpret below terminology- Query - ******************************************************* select ss_sold_time_sk, time_dim.t_time_sk, ss_hdemo_sk, household_demographics.hd_demo_sk, ss_store_sk, s_store_sk, t_hour, t_minute, household_demographics.hd_dep_count, store.s_store_name from store_sales ,household_demographics ,time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and household_demographics.hd_dep_count = 5 and store.s_store_name = 'ese' limit 10000; ********************************************************************* Text Plan (Why there are indentation and numbers like 10:EXCHANGE [UNPARTITIONED] then 09:EXCHANGE [BROADCAST] ) ---------------- Estimated Per-Host Requirements: Memory=384.15MB VCores=4 WARNING: The following tables are missing relevant table and/or column statistics. tpcds_bin_partitioned_textfile_40.household_demographics, tpcds_bin_partitioned_textfile_40.time_dim 10:EXCHANGE [UNPARTITIONED] | limit: 10000 | hosts=2 per-host-mem=unavailable | tuple-ids=0,3,1,2 row-size=80B cardinality=10000 | 06:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: ss_sold_time_sk = time_dim.t_time_sk | runtime filters: RF000 <- time_dim.t_time_sk | limit: 10000 | hosts=2 per-host-mem=148.50KB | tuple-ids=0,3,1,2 row-size=80B cardinality=10000 | |--09:EXCHANGE [BROADCAST] | | hosts=1 per-host-mem=0B | | tuple-ids=2 row-size=16B cardinality=8640 | | | 02:SCAN HDFS [tpcds_bin_partitioned_textfile_40.time_dim, RANDOM] | partitions=1/1 files=1 size=4.79MB | predicates: time_dim.t_hour = 8, time_dim.t_minute >= 30 | table stats: 86400 rows total | column stats: unavailable | hosts=1 per-host-mem=32.00MB | tuple-ids=2 row-size=16B cardinality=8640 | 05:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: ss_hdemo_sk = household_demographics.hd_demo_sk | runtime filters: RF001 <- household_demographics.hd_demo_sk | hosts=2 per-host-mem=9.28KB | tuple-ids=0,3,1 row-size=64B cardinality=22033737 | |--08:EXCHANGE [BROADCAST] | | hosts=1 per-host-mem=0B | | tuple-ids=1 row-size=12B cardinality=720 | | | 01:SCAN HDFS [tpcds_bin_partitioned_textfile_40.household_demographics, RANDOM] | partitions=1/1 files=1 size=141.07KB | predicates: household_demographics.hd_dep_count = 5 | table stats: 7200 rows total | column stats: unavailable | hosts=1 per-host-mem=32.00MB | tuple-ids=1 row-size=12B cardinality=720 | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: ss_store_sk = s_store_sk | runtime filters: RF002 <- s_store_sk | hosts=2 per-host-mem=339B | tuple-ids=0,3 row-size=52B cardinality=22033737 | |--07:EXCHANGE [BROADCAST] | | hosts=1 per-host-mem=0B | | tuple-ids=3 row-size=28B cardinality=11 | | | 03:SCAN HDFS [tpcds_bin_partitioned_textfile_40.store, RANDOM] | partitions=1/1 files=1 size=29.34KB | predicates: store.s_store_name = 'ese' | table stats: 112 rows total | column stats: all | hosts=1 per-host-mem=32.00MB | tuple-ids=3 row-size=28B cardinality=11 | 00:SCAN HDFS [tpcds_bin_partitioned_textfile_40.store_sales, RANDOM] partitions=1824/1824 files=1824 size=13.84GB runtime filters: RF000 -> ss_sold_time_sk, RF001 -> ss_hdemo_sk, RF002 -> ss_store_sk table stats: 115203420 rows total column stats: all hosts=2 per-host-mem=384.00MB tuple-ids=0 row-size=24B cardinality=115203420 ---------------- ****************************************************************** Exec Summary (What to read and how to interpret from below sheet) Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail --------------------------------------------------------------------------------------------------------------------------- 10:EXCHANGE 1 1.492ms 1.492ms 10.00K 10.00K 0 -1.00 B UNPARTITIONED 06:HASH JOIN 0 0.000ns 0.000ns 0 10.00K 0 148.50 KB INNER JOIN, BROADCAST |--09:EXCHANGE 0 0.000ns 0.000ns 0 8.64K 0 0 BROADCAST | 02:SCAN HDFS 1 55.911ms 55.911ms 1.80K 8.64K 8.07 MB 32.00 MB tpcds_bin_partitioned_textf... 05:HASH JOIN 0 0.000ns 0.000ns 0 22.03M 0 9.28 KB INNER JOIN, BROADCAST |--08:EXCHANGE 0 0.000ns 0.000ns 0 720 0 0 BROADCAST | 01:SCAN HDFS 1 32.422ms 32.422ms 720 720 284.00 KB 32.00 MB tpcds_bin_partitioned_textf... 04:HASH JOIN 0 0.000ns 0.000ns 0 22.03M 0 339.00 B INNER JOIN, BROADCAST |--07:EXCHANGE 0 0.000ns 0.000ns 0 11 0 0 BROADCAST | 03:SCAN HDFS 1 50.174ms 50.174ms 15 11 108.00 KB 32.00 MB tpcds_bin_partitioned_textf... 00:SCAN HDFS 0 0.000ns 0.000ns 0 115.20M 0 384.00 MB tpcds_bin_partitioned_textf... ************************************************ Profile is very big output therefore not posting here. Plese help me understanding above data. So far I am just reading articles but not able to come to concrete conclusion, everything is just imagination.

PranayMunshi · ‎04-10-2019

Thank you very much Tim for providing this insight. I have assumption that MEM_LIMIT option is asking for that amount of space for query.

PranayMunshi · ‎04-04-2019

Thanks Tim, This limit(3gb) only work when your IMPALAD's mem_limit is greater than 3gb. I've increased the IMPALAD's mem_limit by invoking the rest api and by manually changing the configuration of Impala but in this way you have to restart the Impala server then only mem_limit will work. I can not understand if IMPALAD's mem_limit is 1gb and if I pass the higher mem_limit (query_option) in jdbc URL then it won't work. What is the point of providing this query option. (1) If my query needed 3gb memory and IMPALAD's mem_limit is 1 gb and I am passing mem_limit=3gb is JDBC url then it won't work. I've to change the mem_limit of IMPALAD and restart the server. And (2) If my query needed 500mb memory and IMPALAD's mem_limit is 1 gb then I don't need to pass mem_limit because in any case it is going to execute. Hope you understood my point. I can conclude that this query option can prevent query to take entire memory of IMPALAD, not for allocating the required memory.

PranayMunshi · ‎04-04-2019

Hi Team, I am trying to perform some testing on Impala so that I can analyze the performance of Impala query based on provided configuration. I am using TPCDS queries. I am making jdbc calls to fire queries. In order to change the configuration values for my query at run time for the current session, I am using Impala query options. I am analysing the query attribute values after execution. In one of the jdbc url I am using "mem_limit" query option, I set its value as 3gb (mem_limit=3gb) But I can not see this value is applied to the current session. I am getting below error- "Memory limit exceeded" this is how I am using query option jdbc:impala://host:21050 /tpcds_bin_partitioned_textfile_40;AuthMech=1;KrbRealm=test.com ;KrbHostFQDN=host;KrbServiceName=impala;mem_limit=3gb;"; But when I changed(mem_limit=3gb) the value from clouodera manager->Impala->configuration, it works fiine. What wrong I am doing here.

Online	Offline
Last Visited	‎04-19-2019 10:40 AM

Member Since	‎03-19-2019 02:17 AM
Last Visited	‎04-19-2019 10:40 AM
Posts	18

Cloudera Community

Re: Does Impala uses fair scheduler? and YARN for ...

Does Impala uses fair scheduler? and YARN for exec...

Re: Does IMPALA cached the query statistics?

Does IMPALA cached the query statistics?

How to understand / analyse Impala Query Text Plan...

Re: Impala mem_limit query option is not working

Re: Impala mem_limit query option is not working

Impala mem_limit query option is not working