Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

combine small parquet files

Re: combine small parquet files

Explorer

What will be happened if set  num_nodes=1

Re: combine small parquet files

Expert Contributor

num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.

Re: combine small parquet files

Explorer

I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file

There is also a feature that you can give meaningful name to target file ))

Re: combine small parquet files

Master Collaborator

@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.

Re: combine small parquet files

Expert Contributor

@Tim Armstrong

But with this solution we will lose the parallelism, no ? and it may be cause a slow insert queries..