- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
When I add a new rack some Impala queries became extremely slow!
- Labels:
-
Apache Impala
-
Cloudera Manager
-
HDFS
Created on ‎07-11-2018 07:31 AM - edited ‎09-16-2022 06:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem is when I start the impala daemons of the 10 new nodes (rack2), the same queries execution time became extremely long (ex. 4.5 min).
NB: I realize that one rack must be faster than two separated racks, but in in my case the difference is huge (about x100)!! and what about the rack awareness in hadoop..
Here is the profile files of the query in two cases:
1 rack (15nodes):
query profile - 2.8 sec
2 racks (15+10 nodes):
query profile - 4.5 min
Thanks in advance.
Created ‎07-11-2018 02:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, creating two clusters is what you could try. I'm no expert in setting this up and unfortunately I also don't have good advice on which tooling to use. distcp certainly could be worth a try.
Within a country your experience will depend on where your machines are, and you'll likely also be affected by reduced bandwidth between data centers.
I'm not sure about other services' behavior when running across racks. Impala is not (yet) rack-aware in its scheduling and exchanges. However, even once we get to adding support for rack-awareness, we might assume that the racks are within a single data-center.
Created ‎07-11-2018 08:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This sounds like a result of the drastically increased link latency between your two "racks". While within a single rack you should see latencies less than a millisecond, US-EU latencies will be around 150ms, depending on where in the US and EU your machines are located. Bandwidth between your locations is likely also much lower than between the racks.
Impala currently does not do any rack-aware scheduling of I/O and data exchanges. In addition it is not optimized for high variance in link latencies and throughput. HDFS itself to my knowledge also makes no optimizations for such a case.
Frankly, I don't think you will see good performance in such a scenario. If you want to increase data availability, you could explore replicating the data between your locations while running queries in only one at a time. If you want to increase service availability, you can look into using a load balancer and switching from one cluster to the other in case of failure.
Created on ‎07-11-2018 10:55 AM - edited ‎07-11-2018 11:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you mean that I have to create two clusters and synchronise the data between them? if yes wht is the best tool to do this? Peers, DistCp HDFS command or another technic ?
Thanks again.
Created ‎07-11-2018 02:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, creating two clusters is what you could try. I'm no expert in setting this up and unfortunately I also don't have good advice on which tooling to use. distcp certainly could be worth a try.
Within a country your experience will depend on where your machines are, and you'll likely also be affected by reduced bandwidth between data centers.
I'm not sure about other services' behavior when running across racks. Impala is not (yet) rack-aware in its scheduling and exchanges. However, even once we get to adding support for rack-awareness, we might assume that the racks are within a single data-center.
Created ‎07-12-2018 06:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Syncronizing data between clusters can be accomplished via distcp, BDR, or ingesting data into both clusters simulatenously using 3rd party tools. The best tool depends on your use case, risk tolerance, and budget.
We don't recommend spanning clusters across large geographic regions (e.g. US to EU); network latency and bandwidth are usually not suitable and could easily result in the slow query times you're experiencing.
We DO support spanning clusters across AWS Availability Zones if certain conditions are met; see Appendix A of Cloudera Enterprise Reference Architecture for AWS Deployments (PDF) details. For comparison, the latency between AWS AZs is typically sub-millisecond.
Spanning bare metal clusters across multiple data centers will be addressed in the next release of Cloudera Enterprise Reference Architecture for Bare Metal Deployments (PDF), to coincide with C6. It will look similar to the AWS guidance, but with the additional caveat that network latency between sides should not exceed 10ms.
Kudu does not support rack awareness.
Not all services provide HA.
