- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Number of ORC files effect on namnode?
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎10-13-2016 06:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
For very large datasets in PB range does it help creating large ORC files?
I understand they should be greater than block size.
So lets say I have a block size of 256 mb and am creating 1 GB ORC files for a hive table of total size 3 TB.
So would it help to create bigger file sizes say of 2 GB?
Keep in mind I will be using ORC index to query only 1 file per partition and that data output would be in kb.
Thanks
Created ‎10-24-2016 08:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As a general rule, you should be creating the largest files you can within a partition.
Check out @David Streever's excellent answer to this question for more details.
Created ‎10-24-2016 08:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As a general rule, you should be creating the largest files you can within a partition.
Check out @David Streever's excellent answer to this question for more details.
Created ‎10-24-2016 09:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Large ORC files with large stripes should be best performance.
Look at this Yahoo article on Hive and ORC at scale
http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches
