Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Limit number of parquet files when doing an insert/create in impala

avatar
Contributor

I am doing something like this 

 

create table test2 stored as parquet as select * from t1;

And I would like to make sure that only 2 parquet files are created lets say is this possible somehow.

 

As know there is no predictable threshold for how many files will be created.

 

Thanks

1 ACCEPTED SOLUTION

avatar
Rising Star

You can do  "set NUM_NODES=1" in your session (before your query), which will cause it to be processed in a single node (just in the coordinator).  It will produce 1 file, up to the default max size of parquet files.

 

You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).

 

But beware the docs state NUM_NODES is not for production use, especially on big tables, as it can put a lot of pressure on a single host and crash that impalad.

 

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_query_options.html

 

-m

View solution in original post

2 REPLIES 2

avatar
Rising Star

You can do  "set NUM_NODES=1" in your session (before your query), which will cause it to be processed in a single node (just in the coordinator).  It will produce 1 file, up to the default max size of parquet files.

 

You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).

 

But beware the docs state NUM_NODES is not for production use, especially on big tables, as it can put a lot of pressure on a single host and crash that impalad.

 

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_query_options.html

 

-m

avatar
Rising Star

Another option I forgot to mention: if your table is partitioned, and your insert query uses dynamic partitioning, it will generate 1 file per partition: 

 

insert into table2 partition(par1,par2) select col1, col2 .. colN, par1, par2 from table1;

 

... again up to the max parquet file size currently set, so you can play with that max to achieve 2 files per partition.

 

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_partitioning.html#partition_st...