Reply
Highlighted
New Contributor
Posts: 5
Registered: ‎05-04-2016

Impala copy file - refresh table vs loaddata

We're using near real time streaming to load data into Impala. We're creating parquet files and would like to load them into existing Impala table. We're directly writing into Impala's partition data directory. For each incoming file we're debating between two approaches to make that new file ( new data ) available into Impala.

e.g. new file is newfile.parquet , table is : mytable, partition is : ( year=2016,month=10)

 

- If partition for the file doesn't exist.

ALTER TABLE mydb.mytable add if not exists partition (year=2016,month=10) 
LOCATION '/data/tables/mytable/year=2016/month=10/';

option #1 . File is already copied into the partition table - run table refresh command. 

REFRESH TABLE mydb.mytable;

option #2 . Run load data command

LOAD DATA INPATH '/data/tables/mytable/year=2016/month=10/' INTO TABLE tablename
  PARTITION (year=2016,month=10);

We're using approach #1 right now as we're already writing data into data directory via multilpe threads and running an asynchronous table refresh instead of running refresh for each file.

 

Question :

Does impala table refresh performance degrades once table grows ( has more data and partitions ) ?

How load data performs / behaves internally when file is already int he data directory ?

Does load data runs refresh command internally ?

Is there a way to refresh a partition in Impala than entire table to improve performance of detecting a new partition ?

Any other recommandation to achieve what we're trying to achieve ?

 

-Sunil

Champion
Posts: 600
Registered: ‎05-16-2016

Re: Impala copy file - refresh table vs loaddata

[ Edited ]

1. Does impala table refresh performance degrades once table grows ( has more data and partitions ) ?

 

It will affect the performace when it is really big, but runing a refresh statement prior can help . 
 
2 . How load data performs / behaves internally when file is already int he data directory ?
Does load data runs refresh command internally 
 
I belive the user have to perform the refresh operation after LOAD.
 
3 . Is there a way to refresh a partition in Impala than entire table to improve performance of detecting a new partition ?
 
The current implementation of the refresh which only performs  incremental update , i,e just the partitions that have been add/removed since last loading of the metadata
 
4.Any other recommandation to achieve what we're trying to achieve ?
 
Runining a  COMPUTE STATS statement after loading of new data in Impala or Outside will improve performace.
avoiding  compression can improve performance
we run Summary command once in a while to check on the Hdfs block skew
 
 
Impala 1.2 and higher, the metadata update is automatic, through the catalogd daemon, for  DDL and DML statements fired through Impala.
 

Explorer
Posts: 25
Registered: ‎09-25-2016

Re: Impala copy file - refresh table vs loaddata

Our table has grown to around 10K partitions now. We're already observing that refresh now takes ~15 seconds which is high.

 

Any tricks for optimizing this ? We know which partition we've added data into; is there a way to hint Impala that only check metadata for limited partitions ?

Announcements