Support Questions

gkeys · ‎09-08-2016

S3 is slower than HDFS for Map-Reduce jobs. Besides that, are there any special considerations or optimizations for ORC files on S3, compared to HDFS? Do they achieve all the benefits on S3 that they do on HDFS? If not, why?

rbalamohan · ‎09-20-2016

Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3). If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a).

Couple of things specific to ORC.

ORC is columnar format and can incur random reads (e.g It reads the end of the file before it starts reading the data blocks). With recent S3A connector, you can set "fs.s3a.experimental.input.fadvise=random" which helps in random reads. Without which, it ends up breaking the https connection with S3 everytime backwards seek is performed. There are other optimizations that have been made internally which helps reducing the connection aborts.
Split computation can take a lot longer with S3 when compared to HDFS. Fortunately, ORC internally has threadpool to compute this in parallel. "hive.orc.compute.splits.num.threads (default 10)" can tuned based on the amount of data processed. Again this is not specific to S3, but tuning these params can give significant perf difference in S3.
In case ORC ETL strategy is chosen (default is HYBRID. hive.exec.orc.split.strategy), one can reduce the footer being read in task by enabling "hive.orc.splits.include.file.footer=true". In earlier versions of hive, this used to have memory pressure in AM side. But in recent versions, this has been fixed. So, this piggy backs the ORC meta information with the split payload and the task does not need to read the metadata again, which can help reduce the number of calls to S3.
There are couple of fixes which helped in reducing the number of times the footer information was being read.

Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments

There have been couple of fixes in Tez side, which helps improving the splits grouping logic. S3 always provides "localhost" as its locality information and this could have adverse impact when used with Tez due to the grouping nature in earlier versions. This is fixed in recent versions, where in Tez does not aggresively group if it does not have enough information about the data locality. This is not specific to ORC, but helps in other formats as well.
There is no concept of racks in AWS. When capacity scheduler is used, it is good to set "yarn.scheduler.capacity.node-locality-delay=0" to avoid container launch delays
When using hive, it would be good to set "hive.metastore.pre.event.listeners= " (empty value) as there is not concept of user group permissions in S3.
For ETL operations in Hive,"hive.metastore.fshandler.threads", "hive.mv.files.thread" can be tuned to improve the performance of file move/metastore related activities.
Set "hive.warehouse.subdir.inherit.perms=false" when using S3 data with Hive

If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job.

From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345

View solution in original post

gkeys · ‎09-12-2016

@Saumitra Buragohain could you help out here? Could use your expertise 🙂

cstanca · ‎09-12-2016

@gkeys

Before discussing ORC, let's keep in mind that S3 is a very good storage space when it comes to cold storage. The S3 data moves at about 50Mbps (could be more or less, but much slower than HDFS). It is a choice for you to pick between speed & cost. Optimizations will only alleviate some performance differences between ORC in HDFS vs ORC on S3, but the data movement limitations will still prevail.

coatespt · ‎09-20-2016

Do you mean 50Mbps per mapper or for the cluster as a whole? (I assume you mean the former, as the latter would imply almost two days to read a TB of S3 data.) Assuming you do mean 50Mbps per mapper, what is the limit on S3 throughput to the whole cluster—that’s the key information. Do you have a ballpark number for this?

rbalamohan · ‎09-20-2016

Optimizations applied for S3 are common for HDFS as well. It is just that these optimizations would be visible in S3 when compared with HDFS (for e.g, getFileStatus/listFiles operations are lot cheaper in HDFS as compared to S3). If you are using ORC in S3, it would be good to use the latest S3a connectors available in HDP 2.4/2.5 (or even easier, try out HDP Cloud which has the latest patches for S3a).

Couple of things specific to ORC.

ORC is columnar format and can incur random reads (e.g It reads the end of the file before it starts reading the data blocks). With recent S3A connector, you can set "fs.s3a.experimental.input.fadvise=random" which helps in random reads. Without which, it ends up breaking the https connection with S3 everytime backwards seek is performed. There are other optimizations that have been made internally which helps reducing the connection aborts.
Split computation can take a lot longer with S3 when compared to HDFS. Fortunately, ORC internally has threadpool to compute this in parallel. "hive.orc.compute.splits.num.threads (default 10)" can tuned based on the amount of data processed. Again this is not specific to S3, but tuning these params can give significant perf difference in S3.
In case ORC ETL strategy is chosen (default is HYBRID. hive.exec.orc.split.strategy), one can reduce the footer being read in task by enabling "hive.orc.splits.include.file.footer=true". In earlier versions of hive, this used to have memory pressure in AM side. But in recent versions, this has been fixed. So, this piggy backs the ORC meta information with the split payload and the task does not need to read the metadata again, which can help reduce the number of calls to S3.
There are couple of fixes which helped in reducing the number of times the footer information was being read.

Some additional changes which are not specific to ORC, which can have impact in AWS/S3 environments

There have been couple of fixes in Tez side, which helps improving the splits grouping logic. S3 always provides "localhost" as its locality information and this could have adverse impact when used with Tez due to the grouping nature in earlier versions. This is fixed in recent versions, where in Tez does not aggresively group if it does not have enough information about the data locality. This is not specific to ORC, but helps in other formats as well.
There is no concept of racks in AWS. When capacity scheduler is used, it is good to set "yarn.scheduler.capacity.node-locality-delay=0" to avoid container launch delays
When using hive, it would be good to set "hive.metastore.pre.event.listeners= " (empty value) as there is not concept of user group permissions in S3.
For ETL operations in Hive,"hive.metastore.fshandler.threads", "hive.mv.files.thread" can be tuned to improve the performance of file move/metastore related activities.
Set "hive.warehouse.subdir.inherit.perms=false" when using S3 data with Hive

If it is very specific to MR, it would be good to enable "hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which helps in reducing the amount of data movement at the end of the job.

From connector perspective, there is a lot of work happening in HADOOP-11694, HADOOP-13204, HADOOP-13345

sburagohain · ‎09-20-2016

Hi Greg, I will provide a high-level response, assuming, you are referring to a cluster hosted in AWS (the intro was covered in the following blog and there will be more detailed blogs to go into details).

http://hortonworks.com/blog/making-elephant-fly-cloud/

Following is the high-level deployment scenario of a cluster in AWS. Hortonworks cloud engineering team has made improvements in S3 connector, in ORC layer, in Hive- available in our cloud distro HDC. During a Hive benchmark test, we saw about 2.5x performance improvements on an average and 2-14x across queries (vs. a vanilla HDP on AWS). HDFS on EBS will be used for intermediate data, while S3 will be used for the persistent storage. We are also enabling LLAP to cache columnar data sitting on S3 in order to further improve query performance on S3. Please stay tuned for the rest of the blog series (do remind us if you don't see them posted soon).

gkeys · ‎09-20-2016

@Saumitra Buragohain Thank you for putting this into proper perspective before parachuting into the weeds!

Cloudera Community

Support Questions

Are there any special considerations or optimizations for ORC files on S3, compared to HDFS?