- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Merge small files in pyspark for Hive table
- Labels:
-
Apache Hive
-
Apache Spark
Created 12-13-2021 11:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an ETL flow which transfers data from a hive table to another through pyspark. The tables are partitioned. Although I see that in the partition's path in HDFS there are small parquet files. I want to ask:
1)How can I merge these files?
2)Is there any max size or recommended size for hive partitions?
Created 01-08-2022 04:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use set hive.merge.tezfiles=true; to fix merge file issue
Created 01-07-2022 05:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1)If those small files are delta files, you can run compaction manually to merge all of them
more details on manual compaction described in the link below:
https://docs.cloudera.com/runtime/7.2.10/managing-hive/topics/hive_initiate_hive_compaction.html
2)Current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.
Thanks,
Sree
Created 01-08-2022 04:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
