Reply
New Contributor
Posts: 3
Registered: ‎07-12-2017

Re: combine small parquet files

[ Edited ]

I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file

There is also a feature that you can give meaningful name to target file ))

Highlighted
Cloudera Employee
Posts: 266
Registered: ‎07-29-2015

Re: combine small parquet files

@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.

New Contributor
Posts: 1
Registered: ‎09-26-2017

Re: combine small parquet files

how would you suggest doing the red parts with zero downtime? (= table1 partition_key=1 is always available)

 

create table table2 like table1;

insert into table2 select * from table1 where partition_key=1;

alter table table1 drop partition (partition_key=1);

insert into table1 partition(partition_key=1) select * from table2;

drop table table2;

Cloudera Employee
Posts: 287
Registered: ‎10-16-2013

Re: combine small parquet files

You can do this:

 

insert overwrite table1 partition(partition_key=1) select * from table1 where partition_key=1;

 

This process should mostly work as you'd expect.

 

However, there are few situations where this may cause problems:

- If you run concurrent "refresh" or "invalidate metadata" commands against that table/partition until the insert is complete, some queries may see missing or dupicate data from that partition (fix via refresh after the insert).

- Do not run concurrent "insert overwrite" against the same partition. You may end up with missing/dupicate data in that partition.

 

If you can guarantee that the above two situations are not a problem for you, then insert overwrite should work just fine.

 

Announcements