I am looking to configure a cluster to support 1,000 concurrent Hive queries on a subset of 100 TB of data. I can partition and store the data anyway I want and use TEZ, ORC and whatever configuration will improvement performance.
I can support Spark SQL or any open source data accelerators that are available. I am looking for basic SQL with joins, nothing too fancy.
A lot of data will be small and this system is very easy to partition by customer and by store.
There are not a ton of columns and the data is very structured.
Would running 4 or more HiveServer2s help?
There's nothing else running the cluster, just Hive queries against the data store. And an HDF ingest adding more data.
This will be read only, controlled inserts.
Yahoo japan has published their scale graph till 1024 concurrent queries, with HDP-2.4.
Including a history of tuning options and fixes that went into their hot-fix branch.
Just to be clear, 1000 concurrent BI users is a different problem from 1000 concurrent queries, which is easier to tackle in LLAP due to the cluster sharing models.The 1000 users will get balanced up internally so that the fact that they have left Tableau open & gone for lunch won't cost server side resources.
Even before LLAP, HS2 can handle 1000 concurrent queries, because that was the exact question discussed in the YJP benchmark (not just concurrency, but throughput/utilization goals as well). With LLAP, we realized that a BI user session which runs a 3s query every 15s, is wasting capacity & terribly common with BI tools where someone is actively reading the screen & navigating their view.
So "open sessions" are different from "queries" now, which in the case of the 3s (every 15s) is a significant improvement in session/user scalability. Much fewer nodes and far easier to configure that.
There is a limit of connections per queue, 10. For example, a Tableau Server connected to Hive would use 1 connection. However, 10 Tableau Desktop users will use 10 connections. Each connection can support query concurrency, let's say 1000, if you have enough resources. Now, with some queries, not even infinite resources will make them scale.
Simple queries and smart servers that could cache and pool queries could work with enough resources. So we'll set expecations based on queries, data size and cluster size. Thanks for the tips.