Reply
Explorer
Posts: 7
Registered: ‎10-28-2015

combine small parquet files

hi all,

 

i have a problem. I have one table in Impala, partition by year, month, day, name of the load file.  Example

 

 

/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file.parq

/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file2.parq

/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file3.parq

 

the size of the file is very smaller (low MB). I would want combine all files in one.

 

/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/combineFiles.parq

 

Someone have any idea for that.

 

Best regards

 

 

Cloudera Employee
Posts: 212
Registered: ‎07-29-2015

Re: combine small parquet files

It's best to do this within Impala. One option is just to create a new table then insert the data from the old one. E.g.

 

create table table2 like table1;

insert into table2 select * from table1;

 

If you only want to combine the files from a single partition, you can copy the data to a different table, drop the old partition, then insert into the new partition to produce a single compacted partition. E.g.

 

create table table2 like table1;

insert into table2 select * from table1 where partition_key=1;

alter table table1 drop partition (partition_key=1);

insert into table1 partition(partition_key=1) select * from table2;

drop table table2;

Explorer
Posts: 6
Registered: ‎11-18-2015

Re: combine small parquet files

Hi Tim,

 

I have the similar issue, within one single partition, there are multiple small files. I already followed the similar solution you suggested, however the number of files did not reduced, infact increased. Here is the link to my question with profile and other details. The blocksize I kept was 256 MB. 

 

http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Insert-query-with-NUM-NODES-1-generatin...

 

Please advise,

Thanks

Amit

Explorer
Posts: 7
Registered: ‎04-08-2016

Re: combine small parquet files

Similar problem, unsolved !

 

Create an unpartitioned table stored as parquet, then insert overwrite the table with select from an old partitioned table.

 

The new unpartitioned table contained 5 .parq files ( size in 70~80kb).

 

Why?

 

Cloudera Employee
Posts: 212
Registered: ‎07-29-2015

Re: combine small parquet files

If the query runs with multiple fragments e.g. on 5 nodes you can get one file per fragment. If you look at the query plan (i.e. explain <the query text>) it will show the degree of parallelism for the insert.

Explorer
Posts: 7
Registered: ‎04-08-2016

Re: combine small parquet files

You're right about 5 impalad nodes. With 10 nodes, insert operations make 10 parquet files and file sizes reduce to 50% ?
Any method to merge files into one?
Cloudera Employee
Posts: 212
Registered: ‎07-29-2015

Re: combine small parquet files

If you set num_nodes=1 that will force it to run on a single node.

Explorer
Posts: 7
Registered: ‎04-08-2016

Re: combine small parquet files

set num_nodes=1
It works! Thanks, Tim.
Explorer
Posts: 8
Registered: ‎07-09-2014

Re: combine small parquet files

What will be happened if set  num_nodes=1

Cloudera Employee
Posts: 65
Registered: ‎12-07-2015

Re: combine small parquet files

num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.

Announcements