New Contributor
Posts: 1
Registered: ‎07-12-2017

Re: combine small parquet files

[ Edited ]

I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file

There is also a feature that you can give meaningful name to target file ))

Cloudera Employee
Posts: 241
Registered: ‎07-29-2015

Re: combine small parquet files

@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.