We have been looking to improve our performance of loading data into Impala. Our current ETL is in Spark and uses HiveSQL to insert the data initially. It then issues a REFRESH in Impala for each table / partition that was written.
We tried switching over to writing the Parquet files directly and calling LOAD DATA instead. In our initial investigation this seemed to be faster but with the production load this seems even slower than the HiveSQL/REFRESH we were doing before. We are seeing LOAD DATA take up to 10 min.
Our tables have many partitions (they are set up as day/source where we have about 50 sources right now) and ~2 years of data. Does LOAD DATA just do a REFRESH under the covers or does it do something more intelligent since it should presumably have direct knowledge of the files/blocks added instead of needing to scan? Does anyone have any suggestions for a more performant way to load the data?