Created on 06-13-2018 12:41 PM - edited 09-16-2022 06:20 AM
I did some experiment on hive. It looks like no matther how much I put on set block size, hive always gave the same result on parquet file sizes. There are a lot small files. Here are the table properties. Can anyone help me? Thanks in advance!
SET hive.exec.dynamic.partition.mode=nonstrict;
SET parquet.column.index.access=true;
SET hive.merge.mapredfiles=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET parquet.compression=SNAPPY;
SET dfs.block.size=445644800;
SET parquet.block.size=445644800;
Created 06-14-2018 05:06 AM
You have mentioned there are lot of small files. And you set the block.size as 445644800 (which is 445 MB approx)
If your block.size > small file then you will not find any difference
Ex: All the below will give the same result
445 MB > 1 MB
400 MB > 1 MB
300 MB > 1 MB
200 MB > 1 MB
100 MB > 1 MB
10 MB > 1 MB
2 MB > 1 MB
may be you will find difference in file size when you set the block.size < small file
Created 05-13-2021 07:15 PM
how exactly do you increase the file size created by the hive job then?
Created 04-20-2022 01:14 AM
You can use one of those query to reduce num of file in insert query, so it will increase the file size: