We're using near real time streaming to load data into Impala. We're creating parquet files and would like to load them into existing Impala table. We're directly writing into Impala's partition data directory. For each incoming file we're debating between two approaches to make that new file ( new data ) available into Impala.
e.g. new file is newfile.parquet , table is : mytable, partition is : ( year=2016,month=10)
- If partition for the file doesn't exist.
ALTER TABLE mydb.mytable add if not exists partition (year=2016,month=10)
option #1 . File is already copied into the partition table - run table refresh command.
REFRESH TABLE mydb.mytable;
option #2 . Run load data command
LOAD DATA INPATH '/data/tables/mytable/year=2016/month=10/' INTO TABLE tablename PARTITION (year=2016,month=10);
We're using approach #1 right now as we're already writing data into data directory via multilpe threads and running an asynchronous table refresh instead of running refresh for each file.
Does impala table refresh performance degrades once table grows ( has more data and partitions ) ?
How load data performs / behaves internally when file is already int he data directory ?
Does load data runs refresh command internally ?
Is there a way to refresh a partition in Impala than entire table to improve performance of detecting a new partition ?
Any other recommandation to achieve what we're trying to achieve ?
1. Does impala table refresh performance degrades once table grows ( has more data and partitions ) ?
It will affect the performace when it is really big, but runing a refresh statement prior can help .
2 . How load data performs / behaves internally when file is already int he data directory ?
Does load data runs refresh command internally
I belive the user have to perform the refresh operation after LOAD.
3 . Is there a way to refresh a partition in Impala than entire table to improve performance of detecting a new partition ?
The current implementation of the refresh which only performs incremental update , i,e just the partitions that have been add/removed since last loading of the metadata
4.Any other recommandation to achieve what we're trying to achieve ?
Runining a COMPUTE STATS statement after loading of new data in Impala or Outside will improve performace.
avoiding compression can improve performance
we run Summary command once in a while to check on the Hdfs block skew
Impala 1.2 and higher, the metadata update is automatic, through the catalogd daemon, for DDL and DML statements fired through Impala.
Our table has grown to around 10K partitions now. We're already observing that refresh now takes ~15 seconds which is high.
Any tricks for optimizing this ? We know which partition we've added data into; is there a way to hint Impala that only check metadata for limited partitions ?
when you say you're using near-real time streaming, are you using Spark Streaming for this?
I am trying to handle Impala tables refresh as new Parquet files are created by the Spark Streaming job but I have no clue of how to achieve this. The optimal solution would be the Spark job doing it after writing each new file into the directory.