Created 06-17-2016 06:41 AM
Guys,
- How to configure Hadoop/Hive for scale of queries hitting it from the API layer ? How many concurrent connnections to WebHDFS can be supported?
- How many concurrent queries can be executed on Hive?
- How to configure Hadoop/Hive for scale of queries hitting it from the API layer
- what data access layer then does one use with Other Platforms from the service layer into Hive/Hadoop? Is is just a JDBC connection at that point or something else?
Thanks.
Created 06-17-2016 08:14 AM
"How many concurrent queries can be executed on Hive"
There is no hard limit. You have two potential bottlenecks
a) HiveServer ( I could reach around 20-30 q/s in the newest Hiveserver version ) but you could add more Hiveservers. You should also increase heap to 8GB and you can increase handler threads a bit.
b) Tez on Yarn. I reached up to 200 concurrent Hive Sessions (Application Masters ) once on a 50 node cluster
But it depends much much more on the complexity of queries you are running. If your queries need 200 containers each on a cluster with 1000 slots you can only run 5 queries in parallel
"How to configure Hadoop/Hive for scale of queries hitting it from the API layer"
a) The most important part is to configure your queries to use as few resources ( containers ) as possible ( partitioning, predicate pushdown ... ).
b) Container reuse
Tez is already configured for Session and container reuse but make sure this works properly. You can also tweak the container release settings in tez. (
tez.am.session.min.held-containers
tez.am.container.idle.release-timeout-max.millis
tez.am.container.idle.release-timeout-min.millis )
Essentially each JDBC connection is one Session. Or you can initialize sessions and Hiveserver will distribute queries among them ( hive.server2.tez.sessions. per.default.queue, hive.server2.tez.initialize.default. sessions) . I prefer to keep the default but then you need to make sure your number of jdbc connections fits to the concurrency of your systemc) Number Sessions
Always keep an eye out on the cluster.
Also use the newest version of HDP or try disabling the timeline server ( in the hive settings not just switch it off )
"what data access layer then does one use with Other Platforms from the service layer into Hive/Hadoop? Is is just a JDBC connection at that point or something else?"
A JDBC pool with x number of open connections that take queries from a pool is best. As said in default mode each jdbc connection will be one Application Master. Also jdbc is faster than beeline/hive shell when it comes to data transfer.
"How to configure Hadoop/Hive for scale of queries hitting it from the API layer ? How many concurrent connnections to WebHDFS can be supported?"
No ide what you mean. Webhdfs? Tez reads data directly from HDFS and it will not be the limit. Your limit will be either IO or much more likely CPU power in your cluster. Or some configuration bottleneck.
Finally it all depends on you queries and your workload so nobody will give you a magical recommendation.
Created 06-17-2016 08:14 AM
"How many concurrent queries can be executed on Hive"
There is no hard limit. You have two potential bottlenecks
a) HiveServer ( I could reach around 20-30 q/s in the newest Hiveserver version ) but you could add more Hiveservers. You should also increase heap to 8GB and you can increase handler threads a bit.
b) Tez on Yarn. I reached up to 200 concurrent Hive Sessions (Application Masters ) once on a 50 node cluster
But it depends much much more on the complexity of queries you are running. If your queries need 200 containers each on a cluster with 1000 slots you can only run 5 queries in parallel
"How to configure Hadoop/Hive for scale of queries hitting it from the API layer"
a) The most important part is to configure your queries to use as few resources ( containers ) as possible ( partitioning, predicate pushdown ... ).
b) Container reuse
Tez is already configured for Session and container reuse but make sure this works properly. You can also tweak the container release settings in tez. (
tez.am.session.min.held-containers
tez.am.container.idle.release-timeout-max.millis
tez.am.container.idle.release-timeout-min.millis )
Essentially each JDBC connection is one Session. Or you can initialize sessions and Hiveserver will distribute queries among them ( hive.server2.tez.sessions. per.default.queue, hive.server2.tez.initialize.default. sessions) . I prefer to keep the default but then you need to make sure your number of jdbc connections fits to the concurrency of your systemc) Number Sessions
Always keep an eye out on the cluster.
Also use the newest version of HDP or try disabling the timeline server ( in the hive settings not just switch it off )
"what data access layer then does one use with Other Platforms from the service layer into Hive/Hadoop? Is is just a JDBC connection at that point or something else?"
A JDBC pool with x number of open connections that take queries from a pool is best. As said in default mode each jdbc connection will be one Application Master. Also jdbc is faster than beeline/hive shell when it comes to data transfer.
"How to configure Hadoop/Hive for scale of queries hitting it from the API layer ? How many concurrent connnections to WebHDFS can be supported?"
No ide what you mean. Webhdfs? Tez reads data directly from HDFS and it will not be the limit. Your limit will be either IO or much more likely CPU power in your cluster. Or some configuration bottleneck.
Finally it all depends on you queries and your workload so nobody will give you a magical recommendation.
Created 07-28-2016 08:14 PM
I'm curious what the advantage of suppressing Hive ATS reporting might be. From an esthetic standpoint we really don't want the UI filled up with myriads of successful, short-running queries, but it would be nice to switch it on in a case by case basis for debugging purposes. Beyond that, would turning it off improve query latency?
Created 07-29-2016 10:45 AM
the big issue is that ATS 1.0 often couldn't keep up with 10s of queries per second on large clusters. And in some situations this limited the number of queries running in the cluster. Like really bad. Like cluster being empty because it would wait for ATS bad. There were some tuning options to make that better but by and large the single ATS server and single leveldb backend had limitations. So less aesthetic and more performance. In ATS 1.5 they made it better ATS 2.0 hopefully fixes that problem once and for all.
Created 08-01-2016 12:53 PM
Thanks for the explanation. We're having significant scaling issues with our 32-host Hortonworks HDP-2.3.2.0-2950 installation. How do I determine which version of ATS is installed? I do not see it listed in the 'Stacks and Versions' page in Ambari. Assuming we're running one of the troublesome versions, what's the most expedient way to disable reporting to ATS from Hive? Since it is useful for debugging, I'm hoping there's a session parameter we can set at query time to suppress reporting when performance is an issue.
Created 08-01-2016 01:31 PM
I think you can try it for one application, one possibility is to simply switch off ATS for a bit that helped me once but not a second time ( Tez still tries to log to it ) So if you really want to switch it off completely you can add the following settings:
tez.history.logging.service.class = org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService and to see the logs: tez.simple.history.logging.dir=/mylogdir
Also removing the following ATSHooks
hive.exec.pre.hooks= hive.exec.post.hooks=
Also potentially reduce log levels
hive.tez.log.level=ERROR
And see if it makes things faster. Again if you don't see a difference you may have other issues. But its worth to rule out. ATS 1.5 has been enabled in HDP2.4. Also ATS1.0 has some tuning options. If that is really your bottleneck Hortonworks support may be able to help.