Created 10-13-2016 06:35 PM
Hi,
For very large datasets in PB range does it help creating large ORC files?
I understand they should be greater than block size.
So lets say I have a block size of 256 mb and am creating 1 GB ORC files for a hive table of total size 3 TB.
So would it help to create bigger file sizes say of 2 GB?
Keep in mind I will be using ORC index to query only 1 file per partition and that data output would be in kb.
Thanks
Created 10-24-2016 08:20 PM
As a general rule, you should be creating the largest files you can within a partition.
Check out @David Streever's excellent answer to this question for more details.
Created 10-24-2016 08:20 PM
As a general rule, you should be creating the largest files you can within a partition.
Check out @David Streever's excellent answer to this question for more details.
Created 10-24-2016 09:04 PM
Large ORC files with large stripes should be best performance.
Look at this Yahoo article on Hive and ORC at scale
http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches