Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

storage strategy of OCR / Parquet file

avatar
Rising Star

let's assume that my HDFS block size is equal to 256Mb and that i need to store 20Gb of data on OCR/Parquet file(s), is it better to store all the data on one OCR/Parquet File, or is it better to store it on many ORC/Parquet files of 256Mb (HDFS Block Size) ?

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Master Guru

By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block.

Files significantly smaller than a block would be bad though.

If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer.

So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together.

More details on how to influence the load can be found below.

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

View solution in original post

1 REPLY 1

avatar
Master Guru

By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block.

Files significantly smaller than a block would be bad though.

If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer.

So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together.

More details on how to influence the load can be found below.

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data