Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Contributor

I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?

3,444 Views

1 ACCEPTED SOLUTION

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

View solution in original post

3,387 Views

2 REPLIES 2

Contributor

1)If those small files are delta files, you can run compaction manually to merge all of them

more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html

2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

Thanks,

Sree

3,399 Views

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

3,389 Views

Announcements

Community Announcements

February 2025 Community Highlights

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics - Kubernetes Operato...

What's New @ Cloudera

[RELEASED] Cloudera Streams Messaging - Kubernetes Operator ...

What's New @ Cloudera

3 Benefits of External IDE Connectivity, Now Available in Cl...

What's New @ Cloudera

Performance comparison of Spark3 on YARN with S3 Standard VS...