Created on 10-28-2015 08:42 AM - edited 09-16-2022 02:46 AM
hi all,
i have a problem. I have one table in Impala, partition by year, month, day, name of the load file. Example
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file.parq
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file2.parq
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/file3.parq
the size of the file is very smaller (low MB). I would want combine all files in one.
/user/hive/warehouse/table/year=2015/month=01/day=01/nameloadfile1/combineFiles.parq
Someone have any idea for that.
Best regards
Created 05-31-2017 03:34 AM
What will be happened if set num_nodes=1
Created 05-31-2017 09:16 AM
num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.
Created on 07-12-2017 01:34 AM - edited 07-12-2017 01:35 AM
I also had this issue, but succeeded by using parquet-tools. There is an option called merge and you are abble to manage it like this hadoop jar parquet-tools merge src-folder-with-parquet target-parquet-file
There is also a feature that you can give meaningful name to target file ))
Created 07-12-2017 02:46 PM
@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
Created 03-06-2018 09:34 AM
@Tim Armstrong
But with this solution we will lose the parallelism, no ? and it may be cause a slow insert queries..
Created 01-10-2020 02:35 AM
As it says in the documentation:
Because this option results in increased resource utilization on a single host, it could cause problems due to contention with other Impala statements or high resource usage. Symptoms could include queries running slowly, exceeding the memory limit, or appearing to hang. Use it only in a single-user development/test environment; do not use it in a production environment or in a cluster with a high-concurrency or high-volume or performance-critical workload.