Support Questions
Find answers, ask questions, and share your expertise

hdfs too many small files ORC VS PARQUET

hdfs too many small files ORC VS PARQUET

New Contributor
0favorite

I am using Hive 1.2 on Horton-works cluster which is 1200 node, I have to create a day level partitioned table which is of size 30TB from source, I opted for ORC, which ends up creating too many part files in a day level partition data is evenly partitioned across the file (not 256MB HDFS block size) even after setting these Hive parameters (set hive.merge.smallfiles.avgsize=256000000;set hive.merge.size.per.task=256000000; set hive.merge.mapredfiles=true;), but parquet is working fine I can see only 10 files in a day level partition compared to ORC which is of 521.

Need your input is this a bug in Hive 1.2 in Hortonworks ? why ORC files are not getting compressed ?

5 REPLIES 5

Re: hdfs too many small files ORC VS PARQUET

Super Guru

@Vamsi Jonnadula I am not sure if it is just me but I am unable to see part of your question. it comes up in a table structure. Any way you can edit your question and remove the table structure the question is in..or is it just me?

Re: hdfs too many small files ORC VS PARQUET

Explorer

@Vamsi Jonnadula, Did this get resolved? We're seeing issues with small ORC files within a partition as well.

Re: hdfs too many small files ORC VS PARQUET

New Contributor

I am also seeing the same issue however , after having some config parameters issue was solved , but when i rerun the same hql for same data twice ( in one run it did not generate small files another run it generating small files ), it's so inconsistent to arrive for a conclusion better horton works looks into it

Re: hdfs too many small files ORC VS PARQUET

@Vamsi Jonnadula and @Vincent Romeo

The ORC file size is also controlled by stripe size and ORC block size. Refer links 1 and 2.

Re: hdfs too many small files ORC VS PARQUET

New Contributor

Thank you for the reply , but i have these in place and weird part of it when i run it first time it works fine, don't see small files , when i execute the same job again ( no changes even on data set) i see small files were created