03-13-2019 06:36 AM - last edited on 03-14-2019 02:28 PM by cjervis
I thought it might be useful to post about a problem i've expereinced with the impala catalogd / statestore and how changing a setting has dramtaically improved performance
18 node cluster, impalaD on 13 nodes.
Load catalog in background = false
Average JVM usage for CatalogD is around 100GB
Tens of thousands of tables, with biggests tables containing over 30K partitions.
Not running anytype of load balancer for ImapalaD
REFRESH commands issued ever 5 minutes for hundreds of tables due to Kafka imports directly into HDFS
When creating a table using impala shell (or HUE) on a ImpalaD node, it would take upto 2 minutes for that table to be available to all the other impalaD instances (monitoed using ImpalaD WebUi's, catalog tab). The same was true for loading of metadata across ImpalaD's after issueing REFRESH command.
There was nothing obvious within the logs as to why the delay.
After reading Cloduera's Documentation about Scalability for Impala i noticed the following text
-statestore_num_update_threads The number of threads inside the statestore dedicated to sending topic updates. You should not typically need to change this value. Default: 10
After changing this to 50, this 2 minute delay has dropped down to sub 3 seconds for the majority of ImpalaD's.