This affected us as well. Here are some potentially related bugs. My understanding of the internals is that it is doing a full table refresh. If you are doing LOAD DATA with partition, ideally it would only do an incremental refresh instead of full and you would get closer to flat times. https://issues.apache.org/jira/browse/IMPALA-7330 https://issues.apache.org/jira/browse/IMPALA-7854 What version of Impala are you using?
... View more
We have been looking to improve our performance of loading data into Impala. Our current ETL is in Spark and uses HiveSQL to insert the data initially. It then issues a REFRESH in Impala for each table / partition that was written.
We tried switching over to writing the Parquet files directly and calling LOAD DATA instead. In our initial investigation this seemed to be faster but with the production load this seems even slower than the HiveSQL/REFRESH we were doing before. We are seeing LOAD DATA take up to 10 min.
Our tables have many partitions (they are set up as day/source where we have about 50 sources right now) and ~2 years of data. Does LOAD DATA just do a REFRESH under the covers or does it do something more intelligent since it should presumably have direct knowledge of the files/blocks added instead of needing to scan? Does anyone have any suggestions for a more performant way to load the data?
... View more