10-10-2017 11:01 AM
I am doing something like this
create table test2 stored as parquet as select * from t1;
And I would like to make sure that only 2 parquet files are created lets say is this possible somehow.
As know there is no predictable threshold for how many files will be created.
10-10-2017 08:05 PM
You can do "set NUM_NODES=1" in your session (before your query), which will cause it to be processed in a single node (just in the coordinator). It will produce 1 file, up to the default max size of parquet files.
You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).
But beware the docs state NUM_NODES is not for production use, especially on big tables, as it can put a lot of pressure on a single host and crash that impalad.
10-10-2017 08:10 PM
Another option I forgot to mention: if your table is partitioned, and your insert query uses dynamic partitioning, it will generate 1 file per partition:
insert into table2 partition(par1,par2) select col1, col2 .. colN, par1, par2 from table1;
... again up to the max parquet file size currently set, so you can play with that max to achieve 2 files per partition.