I have large csv files which arrives Hadoop on a daily basis.(10GB). 1 file per day. I have a Hive external table and point it to the files (No partitions / No ORC) - Table1. I have another table Table2(external table + ORC-ZLIB) partitioned by date(yyyy-mm-dd) loaded from Table1 using insert into Table2 partition(columnname) select * from Table1 with hive.exec.dynamic.partition = true enabled. The daily files once compressed via ORC comes to <10MB(this was a surprise to me looking at the compression ratio). I have read about the multiple small file problems in Hadoop from the HW community.
Is there any additional settings in Hive / considerations to be in place so that we don't run into performance issues caused by the multiple small files?
To control the no of files inserted in hive tables we can either change the no of mapper/reducers to 1 depending on the need, so that the final output file will always be one. If not anyone of the below things should be enable to merge a reducer output if the size is less than an block size.
Hi @Bala Vignesh N V, I have similar issue. have done the above settings, but this does not help. I have posted a question on HCC : https://community.hortonworks.com/questions/109365/controlling-number-of-small-files-while-inserting....
Thanks for your response.If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting?
If the above assumption is correct(pls correct if that is wrong), will I end up with select queries which runs slower if I store files for say 6 years?
i.e I will have 6 * 365 files each around say 8-9MB in size (which is smaller than the default chunk size).
I was hoping to consolidate the files on a weekly basis but , I need to have data available to users on a daily basis ..hence I dont think I can do that.
Let me know the suggestions.
If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting? --> Its not that simple, because it depends on the size of your input file, block size, size of mapper /reducer an other variables. Considering your input file is less than the block size then it should create only one file.
If you partition the table on a daily basis with less size then in growth of time it will cause performance issues and there is not much to do with partition. What I would say on such condition, is that partition the table on yearly basis with buckets on a frequently used filter column. In your case it can be daily/weekly/yearly basis. But still each file in a bucketed folder will be less if the data size is less.
I found out that when using Tez (an execution engine on Hive) you should use another parameter to get as an output only 1 file:
Tez in many cases is faster than MR2 engine, to verify what execution engine you use run on Hive:
if you want to switch to Tez, just set it etiher in hive-site,xml or for each Hive session:
SET hive.execution.engine=tez;Best regards,
I have 10 files each of size 64GB, a hive table is created on top of these files.
data read operations are slow, thinking 64GB file read is the culprit. file format is text
can someone check this scenario
I can see you have already partitioned your tables. One important thing is if you are using hive you need to make sure that the tables are partitioned on the column where most of your data is clubbed. So the date Column in my suggestion is the best option here.
Also, for those who are new to hive/hdfs, multiple files are created when you load data in hive tables. so don't get confused just pick the right partition and your files will be reduced in hdfs where hive data is being saved.
@nikkie_thomas You can set below if you are using Tez
set hive.merge.smallfiles.avgsize=<some value>;
set hive.merge.size.per.task=<some value>;