- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
storage strategy of OCR / Parquet file
- Labels:
-
Apache Hadoop
Created 01-29-2016 10:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
let's assume that my HDFS block size is equal to 256Mb and that i need to store 20Gb of data on OCR/Parquet file(s), is it better to store all the data on one OCR/Parquet File, or is it better to store it on many ORC/Parquet files of 256Mb (HDFS Block Size) ?
Created 01-29-2016 10:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block.
Files significantly smaller than a block would be bad though.
If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer.
So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together.
More details on how to influence the load can be found below.
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Created 01-29-2016 10:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By and large, large ORC files are better. HDFS has a sweetspot for files that are 1-10 times the block size. But 20GB should also be ok. There will be one map task for each block of the ORC file anyway. So the difference should be not big as long as your files are as big or bigger than a block.
Files significantly smaller than a block would be bad though.
If you create a very big file just keep an eye out for stripe sizes in the ORC file if you see any performance problems. I have sometimes seen very small stripes due to memory restrictions in the writer.
So if you want to aggregate a large amount of data as fast as possible having a single big file would be good. However having one 20GB ORC file also means you have loaded it with one task so the load will normally be too slow. You may want to have a couple reducers to increase load speed. Alternatively you can also use ALTER TABLE CONCATENATE to merge small ORC files together.
More details on how to influence the load can be found below.
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
