For very large datasets in PB range does it help creating large ORC files?
I understand they should be greater than block size.
So lets say I have a block size of 256 mb and am creating 1 GB ORC files for a hive table of total size 3 TB.
So would it help to create bigger file sizes say of 2 GB?
Keep in mind I will be using ORC index to query only 1 file per partition and that data output would be in kb.
Large ORC files with large stripes should be best performance.
Look at this Yahoo article on Hive and ORC at scale