We're a streaming application that's writes parquet files to HDFS to a partitioned ( partitioned by day and one more custom integer customer id) impala folder. We need to run refresh table in order to make Impala aware of the new files. The files are generated every minute and we run refresh table command every 2 minutes.
We're two options 1) Run "refresh table <table name>" or 2) use "refresh table partition <partition spec>" ( available in CDH 5.11 / 5.10 onwards that refresh perticular partition.
In terms of total time taken; "Refresh table <table name>" is very efficient in terms of time taken. It takes ~ 20 seconds or someting vs 5-7 seconds for each partition using "refresht able <table name > <partition spec>.
I'd like to ask community and especially Impala team; what is recommanded to use in use case like ours. Running 30 individual refresh every minute or running one ? Or is there a third option that we don't know about ?
- In general, the suggestion is to run as few 'refresh'es as possible. So I'd suggest running a 'refresh <table>' every 20 mins than running a 'refresh <partition>' every 2 mins if that meets your application SLAs.
- Each refresh triggers a spike in working memory on the Catalog (and the Impalads) due to thrift serialization (and deserialization on the Impalads) and the cost is constant irrespective of whether you run a refresh on the table or the partition since we serialize (and deserialize) the whole table object. This can be costly if the partitioned table is huge or with lots of files and blocks. (https://issues.apache.org/jira/browse/IMPALA-3127)
- FWIW, these load operations are much faster (~2x on secure and ~5x on insecure) starting CDH5.14 due to performance enhancements.
We've requirement fro low latency data availability. So there is a pressure to run this even more frequently not less. Would it help if we allocate more memory to catalog service or statestore service ?
In that case, you could try "refresh <partition>" and see the peak JVM memory usage on the Catalogds and the Impalads and if it is close to hitting OOM, increase the -Xmx .
Also, from our experience, using incremental stats with high refresh load can quickly trigger OOM issues. So, better not to rely on them.
Is there a way we can refresh multiple partitions in a single statement?.
What I am trying to execute is something like "refresh <table_name> partition ( col in ('val1','val2',val3'......))". So that I do not have to fire multiple refresh queries.
If it is not present, are there plans to bring this feature anytime in the near future?.