We have created a cluster with EC2 instances in AWS. We are being advised to span it across multiple availability zones to give our platform higher availability. Is there any Hadoop architecture design that supports this? I know you have answered this question 3 years ago that there wasn't but is there any update since then?
While it is possible to span a cluster across multiple Availability Zones, we don’t generally recommend spanning availability zones for the following reasons:
1. In AWS, there is a natural latency involved with data moving across multiple AZs which will lead to performance problems or other issues in the cluster. Specially as the cluster size and workload increases the performance issues are more pronounced.
2. Double billing: As a natural part of cluster functioning, there will be data transfer. According to AWS FAQs - Each instance is charged for its data in and data out at corresponding Data Transfer rates. Therefore, if data is transferred between these two instances, it is charged at "Data Transfer Out from EC2 to Another AWS Region" for the first instance and at "Data Transfer In from Another AWS Region" for the second instance. Please refer to this page for detailed data transfer: https://aws.amazon.com/ec2/faqs/
However, mission requirements and/or regulations may require spanning multiple availability zones which is possible and supported.
I would also like to add that we support multiple AZs, but not multiple Regions within a single cluster.