11-28-2017 02:41 AM
For some performances measurement, I would like to control the number of fragment of a query. Is it possible?
Also, is it possible to have more than 1 fragment by impala daemon?
11-28-2017 12:22 PM
I wanted to clarify the question a bit to understand what you're trying to achieve. Impala really has two different concepts.
"Fragments" are a way of breaking down the query plan into units that can be executed in a distributed manner. You can see these in query plans with explain_level >= 2. They show up as sections of the plan with a heading like "F00: PLAN FRAGMENT". There are only two modes here. The default is to produce a distributed plan, which is broken up into fragments. The alternative, when the option num_nodes is set to 1, is to produce a single-node plan with only a single fragment.
The other concept is "fragment instances", which is the number of instances of each plan fragment that are run by the query. By default you generally get 0 or 1 fragments per impala daemon, depending on whether there is any data to scan, but we will do the scanning of data in a multi-threaded way. We have a new mode, under development, where you get multiple fragments per Impala daemon, controlled by the mt_dop query option. This only works for some queries, without inserts or joins and can sometimes consume a lot more resources. mt_dop can increase throughput of queries if the bottleneck is outside of the scan, e.g. in an aggregation.
11-29-2017 01:04 AM
Many thanks for the answer! The mt_dop is exactly what we need.
I hope this development will be available with impala 2_8.
The usecase is we are migrating from a "many small servers" cluster to a "fewer bigger servers" cluster, with a 6 time factor reduction. Even with the same hardware performances, we end up having too few fragment instances to exploit all cpu.