Support Questions
Find answers, ask questions, and share your expertise

Merge small files in pyspark for Hive table

Explorer

I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?

1 ACCEPTED SOLUTION

Expert Contributor

@drgenious 

 

You can use set hive.merge.tezfiles=true; to fix merge file issue

View solution in original post

2 REPLIES 2

Cloudera Employee

@drgenious 

1)If those small files are delta files, you can run compaction manually to merge all of them

more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html

 

2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

 

Thanks,

Sree

Expert Contributor

@drgenious 

 

You can use set hive.merge.tezfiles=true; to fix merge file issue

; ;