Created 12-13-2021 11:21 PM
I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?
Created 01-08-2022 04:40 AM
You can use set hive.merge.tezfiles=true; to fix merge file issue
Created 01-07-2022 05:45 AM
1)If those small files are delta files, you can run compaction manually to merge all of them
more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html
2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.
Thanks,
Sree
Created 01-08-2022 04:40 AM