Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Contributor

I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?

4,015 Views

1 ACCEPTED SOLUTION

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

View solution in original post

3,958 Views

2 REPLIES 2

Contributor

1)If those small files are delta files, you can run compaction manually to merge all of them

more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html

2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

Thanks,

Sree

3,970 Views

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

3,959 Views

Announcements

What's New @ Cloudera

Announcing Cloudera Streaming Analytics - Kubernetes Operato...

What's New @ Cloudera

Announcing Cloudera Streams Messaging - Kubernetes Operator ...

Community Announcements

Welcome to the Cloudera Community!

Community Announcements

September 2025 Community Highlights

What's New @ Cloudera

Upgrade Your Spark Experience: Introducing CDS 3.5 for Cloud...