About alex.behm

alex.behm · ‎01-06-2016

Correct, you can use jmap to get heapdumps and jhat or other heap dump analysis tools to read the dumps. I'd recommend trying a few heap analysis tools to see which one you like.

alex.behm · ‎01-05-2016

Sorry to hear it did not help to reduce the memory consumption. I'm not really sure why that would be the case. If you want to investigate further, I'd recommend getting heap dumps of the Java process before and after the "invalidate metadata" to see where the memory is going.

alex.behm · ‎01-05-2016

That's strange and somewhat unexpected. My suggestion is not exactly a tested scenario and more of a side effect of our implementation of "invalidate metadata", so maybe there are issues I am not thinking of that would prevent the objects being cleanup up by the Java GC. After doing the "invalidate metadata", are you sure the table is not being accessed? To verify the state of metadata loading you can go to the catalogd Web UI (default port 25020) and inspect the contents of your table metadata via the /catalog tab. Make sure that you are starting the catalogd with --load_catalog_in_background=false, but I assume that's already the case since it's the default. Yes, upon first access of a table, the catalogd will load and cache the metadata for that table.

alex.behm · ‎01-04-2016

Since both Impala's and Hive's Metadata are backed by the Hive Metastore, you cannot completely remove a table from only one or the other. By default, Impala loads the metadata of tables lazily, i.e., only when a table is accessed in Impala. After the initial loading, the table metadata is cached in the catalogd and impalads. If your goal is to reduce the memory burden on the catalogd, then you can call "invalidate metadata <table_name>" in Impala on those tables you want to "remove" from Impala. This will replace the full metadata for that table with a "dummy" table entry which uses an insignificant amount of memory. However, if you access that table again in Impala, then the metadata will be loaded again. So as long as you don't access those tables whose metadata has not been loaded, you are not using much memory.

alex.behm · ‎01-04-2016

Thanks for the detailed explanation! See my responses below: The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values). RE: Question 1 That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences. The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions. RE: Question 2 Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements. RE: Question 3 Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.

alex.behm · ‎01-04-2016

A few short responses to your questions: - How to predict how much memory catalogd need? The catalog caches table from the Hive Metastore as well as block location information from HDFS. The memory consumed will depende on the number of HDFS files and blocks as well as the number of databases, tables, and partitions. - What factores can contribute to the memory usages of catalogd? Incremental stats adds an additional memory requirement. - Is there any way to fource catalogd to release or flush memory? Not directly. If the catalogd is using too much memory, reducing the number of partitions and/or files/blocks should help.

alex.behm · ‎01-04-2016

See my response in the other thread: http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/why-show-column-stats-lt-table-name-gt-doesn-t-show-statistics/m-p/35701#M1355

alex.behm · ‎01-04-2016

Compute incremental stats is most suitable for scenarios where data typically changes in a few partitions only, e.g., adding partitions or appending to the latest partition, etc. The first time you do COMPUTE INCREMENTAL STATS it will compute the incremental stats for all partitions. The next time, it will only compute the stats for partitions that have changed in the mean time. You can manually DROP INCREMENTAL STATS for a particular partition if you want to force re-computing stats for that partition in the next COMPUTE INCREMENTAL STATS. Alternatively, you can also COMPUTE INCREMENTAL STATS for specific partitions only, so you can control how much work shoudl be done for computing stats. Note that incremental stats have additional memory requirements on all daemons, so be sure to follow then guidelines in the Impala cookbook: http://www.slideshare.net/cloudera/the-impala-cookbook-42530186

alex.behm · ‎12-30-2015

Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level. For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis. It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work. That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.

alex.behm · ‎12-30-2015

Thanks. I can reproduce the problem on a partitioned table (unpartitioned works), give me some time to look into it.

Online	Offline
Last Visited	‎05-10-2018 06:52 PM

Member Since	‎10-16-2013 11:04 AM
Last Visited	‎05-10-2018 06:52 PM
Posts	307
Kudos received	77

Cloudera Community

Re: External Table from Parquet folder returns emp...

Re: Impala SQL for KUDU does not work

Re: Impalad logs diskspace full

Re: Impala round function does not return expected...

Re: Is Impala a proces engine when I use kudu?

Re: Is there any Impala SQL command which can remo...

Re: Is there any Impala SQL command which can remo...

Re: Is there any Impala SQL command which can remo...

Re: Is there any Impala SQL command which can remo...

Re: why 'show column stats <table_name>` doesn't s...

Re: How to predict how much memory catalogd needs?

Re: WARNINGS: Too many partitions selected, doing ...

Re: why 'show column stats <table_name>` doesn't s...

Re: why 'show column stats <table_name>` doesn't s...

Re: why 'show column stats <table_name>` doesn't s...