Created 09-08-2016 03:05 AM
S3 is slower than HDFS for Map-Reduce jobs. Besides that, are there any special considerations or optimizations for ORC files on S3, compared to HDFS? Do they achieve all the benefits on S3 that they do on HDFS? If not, why?
Created 09-20-2016 06:39 AM
Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3). If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a).
Couple of things specific to ORC.
Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments
If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job.
From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345
Created 09-12-2016 05:48 PM
@Saumitra Buragohain could you help out here? Could use your expertise 🙂
Created 09-12-2016 05:59 PM
Before discussing ORC, let's keep in mind that S3 is a very good storage space when it comes to cold storage. The S3 data moves at about 50Mbps (could be more or less, but much slower than HDFS). It is a choice for you to pick between speed & cost. Optimizations will only alleviate some performance differences between ORC in HDFS vs ORC on S3, but the data movement limitations will still prevail.
Created 09-20-2016 04:14 PM
Do you mean 50Mbps per mapper or for the cluster as a whole? (I assume you mean the former, as the latter would imply almost two days to read a TB of S3 data.) Assuming you do mean 50Mbps per mapper, what is the limit on S3 throughput to the whole cluster—that’s the key information. Do you have a ballpark number for this?
Created 09-20-2016 06:39 AM
Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3). If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a).
Couple of things specific to ORC.
Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments
If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job.
From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345
Created on 09-20-2016 06:41 AM - edited 08-19-2019 01:10 AM
Hi Greg, I will provide a high-level response, assuming, you are referring to a cluster hosted in AWS (the intro was covered in the following blog and there will be more detailed blogs to go into details).
http://hortonworks.com/blog/making-elephant-fly-cloud/
Following is the high-level deployment scenario of a cluster in AWS. Hortonworks cloud engineering team has made improvements in S3 connector, in ORC layer, in Hive- available in our cloud distro HDC. During a Hive benchmark test, we saw about 2.5x performance improvements on an average and 2-14x across queries (vs. a vanilla HDP on AWS). HDFS on EBS will be used for intermediate data, while S3 will be used for the persistent storage. We are also enabling LLAP to cache columnar data sitting on S3 in order to further improve query performance on S3. Please stay tuned for the rest of the blog series (do remind us if you don't see them posted soon).
Created 09-20-2016 02:00 PM
@Saumitra Buragohain Thank you for putting this into proper perspective before parachuting into the weeds!