Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

Contributor

I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?

3,182 Views

1 ACCEPTED SOLUTION

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

View solution in original post

3,125 Views

2 REPLIES 2

Contributor

1)If those small files are delta files, you can run compaction manually to merge all of them

more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html

2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

Thanks,

Sree

3,137 Views

Guru

You can use set hive.merge.tezfiles=true; to fix merge file issue

3,126 Views

Announcements

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics 1.14 for Cloudera Pu...

What's New @ Cloudera

Cloudera Data Engineering 1.23: Access Spark from Your Favor...

What's New @ Cloudera

HBase REST server scaling support is Generally Available

What's New @ Cloudera

New CLI option in the update-database command

What's New @ Cloudera

New Action menu item in the Cloudera Operational Database UI