Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Limit number of parquet files when doing an insert/create in impala

avatar
Contributor

I am doing something like this 

 

create table test2 stored as parquet as select * from t1;

And I would like to make sure that only 2 parquet files are created lets say is this possible somehow.

 

As know there is no predictable threshold for how many files will be created.

 

Thanks

1 ACCEPTED SOLUTION

avatar
Rising Star

You can do  "set NUM_NODES=1" in your session (before your query), which will cause it to be processed in a single node (just in the coordinator).  It will produce 1 file, up to the default max size of parquet files.

 

You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).

 

But beware the docs state NUM_NODES is not for production use, especially on big tables, as it can put a lot of pressure on a single host and crash that impalad.

 

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_query_options.html

 

-m

View solution in original post

2 REPLIES 2

avatar
Rising Star

You can do  "set NUM_NODES=1" in your session (before your query), which will cause it to be processed in a single node (just in the coordinator).  It will produce 1 file, up to the default max size of parquet files.

 

You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).

 

But beware the docs state NUM_NODES is not for production use, especially on big tables, as it can put a lot of pressure on a single host and crash that impalad.

 

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_query_options.html

 

-m

avatar
Rising Star

Another option I forgot to mention: if your table is partitioned, and your insert query uses dynamic partitioning, it will generate 1 file per partition: 

 

insert into table2 partition(par1,par2) select col1, col2 .. colN, par1, par2 from table1;

 

... again up to the max parquet file size currently set, so you can play with that max to achieve 2 files per partition.

 

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_partitioning.html#partition_st...