Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Merge small files in pyspark for Hive table

avatar
Explorer

I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@drgenious 

 

You can use set hive.merge.tezfiles=true; to fix merge file issue

View solution in original post

2 REPLIES 2

avatar
Cloudera Employee

@drgenious 

1)If those small files are delta files, you can run compaction manually to merge all of them

more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html

 

2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

 

Thanks,

Sree

avatar
Super Collaborator

@drgenious 

 

You can use set hive.merge.tezfiles=true; to fix merge file issue

Labels