Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to determine the number of requests/connections going to Hive Metastore Database from HMS?

avatar
Explorer

I know we have Hive properties hive.metastore.server.max.threads and hive.metastore.server.min.threads. But how to find the Active threads of HMS. And what (job or user) causing the more active threads? 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi @BTKR/Tej,

 

Your subject and your actual question differ a lot and both are completely different things.

If you are concerned about the number of connections going to Metastore database from the Hive Metastore process, you can use the below way:

 

1. Find out the PID of HMS process on the server using the below command:

ps -ef | grep -i hivemetastore

 

2. Once you have the PID, get the put of below command:

lsof -p PID | grep ESTABLISHED

This will give you list of all the connections being made to and by the Hivemetastore process. This will also include the connections made "TO" Hivemetastore process from the clients i.e. from the Hive CLI shells.

Please look for the database type in the output to confirm the connections FROM Hivemetastore to HMS DB.

For example on my side, I have the below output:

Screenshot 2019-10-25 at 10.49.23 AM.png

In the above photo, all the outputs that have "mysql" in it are being made from HiveMetastore Process to HMS DB.

 

As per your question in the description, if you want to find out how many threads are try to connect to the database at any time, you can collect the jstack of the HMS process and then look for the threads referring the mysql calls(which is the database type in my case, you can look for oracle or postgres if any of those are your database types).

 

Also, I get a feeling that you are concerned by the number of connections being made to the database.

You can check the below property via Hive CLI and beeline(this property will not listed in the Ambari as it is built in):

 

set datanucleus.connectionPool.maxPoolSize;

--This will give to the connection pool size. 10 is the default value, if set to something else, please let me know.

Also, share the output of below query:
set datanucleus.connectionPoolingType;

 

Do confirm the exact HDP version you are on!!!

 

Please note the connection pool set to 10 does not mean there will only 10 connections to HMS DB, there can be more connections, but if this value is increased, the number of connections also increases exponentially to HMS DB.

 

Sometimes, it is suggested to increase the connection pool size to accommodate the huge load of queries on Hive.

So, if you are using your Hive services extensively, and the connectionpoolsize is set to a higher value, I would suggest to fix the issue on the HMS DB side to to allow more number of connections.

For example, on MySQL, there is max_connections, you can increase it to 1000 or more.

 

Let me know if the above information was helpful!!

 

Thanks,

Rohit Rai Malhotra

View solution in original post

4 REPLIES 4

avatar
Explorer

Let me explain my question very clearly, if Hive Meta Store getting connection failure alerts because of the more load (more connections) , how could we find what process causing the more connections to HMS.?

avatar
Expert Contributor

Hi @BTKR/Tej,

 

Your subject and your actual question differ a lot and both are completely different things.

If you are concerned about the number of connections going to Metastore database from the Hive Metastore process, you can use the below way:

 

1. Find out the PID of HMS process on the server using the below command:

ps -ef | grep -i hivemetastore

 

2. Once you have the PID, get the put of below command:

lsof -p PID | grep ESTABLISHED

This will give you list of all the connections being made to and by the Hivemetastore process. This will also include the connections made "TO" Hivemetastore process from the clients i.e. from the Hive CLI shells.

Please look for the database type in the output to confirm the connections FROM Hivemetastore to HMS DB.

For example on my side, I have the below output:

Screenshot 2019-10-25 at 10.49.23 AM.png

In the above photo, all the outputs that have "mysql" in it are being made from HiveMetastore Process to HMS DB.

 

As per your question in the description, if you want to find out how many threads are try to connect to the database at any time, you can collect the jstack of the HMS process and then look for the threads referring the mysql calls(which is the database type in my case, you can look for oracle or postgres if any of those are your database types).

 

Also, I get a feeling that you are concerned by the number of connections being made to the database.

You can check the below property via Hive CLI and beeline(this property will not listed in the Ambari as it is built in):

 

set datanucleus.connectionPool.maxPoolSize;

--This will give to the connection pool size. 10 is the default value, if set to something else, please let me know.

Also, share the output of below query:
set datanucleus.connectionPoolingType;

 

Do confirm the exact HDP version you are on!!!

 

Please note the connection pool set to 10 does not mean there will only 10 connections to HMS DB, there can be more connections, but if this value is increased, the number of connections also increases exponentially to HMS DB.

 

Sometimes, it is suggested to increase the connection pool size to accommodate the huge load of queries on Hive.

So, if you are using your Hive services extensively, and the connectionpoolsize is set to a higher value, I would suggest to fix the issue on the HMS DB side to to allow more number of connections.

For example, on MySQL, there is max_connections, you can increase it to 1000 or more.

 

Let me know if the above information was helpful!!

 

Thanks,

Rohit Rai Malhotra

avatar
Explorer

Rohit,

 

  Thanks for your response. 

output of 

set datanucleus.connectionPool.maxPoolSize;  is 80 ;

HDP version 2.6.5.4

As you suggested lsof -p PID | grep ESTABLISHED    will provide the connections made to HMS.

My question is if your cluster has 1000 nodes. If you run a any hive or spark job it will get the resources from many nodes and run on many nodes. 

If you want to find out what process causing the more Threads to the HMS. How could i find out?

I believe if we run a job it may cause many number of threads depends on the job.

is there any way to find out what process which is causing more threads to HMS?

 

Regards,

Tej

avatar
Explorer

Rohit,

 

   Sorry for not being so specific in my post. is there any way to find out the process(job) which is causing most threads to Hive Meta Store in a 1000 node cluster?

 

Regards,

Tej