07-11-2018 07:31 AM - edited 07-11-2018 09:22 AM
07-11-2018 08:52 AM
This sounds like a result of the drastically increased link latency between your two "racks". While within a single rack you should see latencies less than a millisecond, US-EU latencies will be around 150ms, depending on where in the US and EU your machines are located. Bandwidth between your locations is likely also much lower than between the racks.
Impala currently does not do any rack-aware scheduling of I/O and data exchanges. In addition it is not optimized for high variance in link latencies and throughput. HDFS itself to my knowledge also makes no optimizations for such a case.
Frankly, I don't think you will see good performance in such a scenario. If you want to increase data availability, you could explore replicating the data between your locations while running queries in only one at a time. If you want to increase service availability, you can look into using a load balancer and switching from one cluster to the other in case of failure.
07-11-2018 10:55 AM - edited 07-11-2018 11:14 AM
07-11-2018 02:42 PM
Yes, creating two clusters is what you could try. I'm no expert in setting this up and unfortunately I also don't have good advice on which tooling to use. distcp certainly could be worth a try.
Within a country your experience will depend on where your machines are, and you'll likely also be affected by reduced bandwidth between data centers.
I'm not sure about other services' behavior when running across racks. Impala is not (yet) rack-aware in its scheduling and exchanges. However, even once we get to adding support for rack-awareness, we might assume that the racks are within a single data-center.
07-12-2018 06:16 AM
Syncronizing data between clusters can be accomplished via distcp, BDR, or ingesting data into both clusters simulatenously using 3rd party tools. The best tool depends on your use case, risk tolerance, and budget.
We don't recommend spanning clusters across large geographic regions (e.g. US to EU); network latency and bandwidth are usually not suitable and could easily result in the slow query times you're experiencing.
We DO support spanning clusters across AWS Availability Zones if certain conditions are met; see Appendix A of Cloudera Enterprise Reference Architecture for AWS Deployments (PDF) details. For comparison, the latency between AWS AZs is typically sub-millisecond.
Spanning bare metal clusters across multiple data centers will be addressed in the next release of Cloudera Enterprise Reference Architecture for Bare Metal Deployments (PDF), to coincide with C6. It will look similar to the AWS guidance, but with the additional caveat that network latency between sides should not exceed 10ms.
Not all services provide HA.