Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar

The general perception that one needs to thoroughly know the code base for Understanding and Debugging a HDP service. Do appreciate the fact that HDP services including the execution engine (MR, TEZ, SPARK) are all JVM based processes. JVM along with the Operating Systems provides various knobs to know what the process is doing and performing at run-time.

The steps mentioned below can be applied for understanding and debugging any JVM process in general.

Lets take an example of how HiveServer2 works assuming one is not much acquainted or has a deep understanding of the code base. We are assuming one knows what the service does, but how it works internally the user is not aware of

1. Process Resource Usage

Gives a complete overview of the usage pattern of CPU, memory by the process providing a quick insight of the health of the process

85439-process-resource-usage.jpg



2. How to figure out which all service the process interacts with

ADD JMX parameters to the service allowing you to visualize whats happening within the JVM at run-time using jconsole or jvisualm


85429-jmx.jpg

What this ensures is the JVM is broadcasting the metrics on port 7001 and can be connected using jconsole or Jvisualm. There is no security enabled on can add SSL certificates too for authentication



What can we infer about the kind of threads we see from jconsole


Number of threads have peaked to 461, currently only 170 are active

85430-threads-info.jpg

Abandoned connection cleaner is cleaning the lost connections.

85431-kerberos.jpg

Kafka- kerberos-refres-thread : As HS2 supports kerberos, TGT needs to be refreshed after the renewal period. It means we dont have to manually refresh the kerberos ticket.

HS2 is interacting with Kafka and Atlas as can be seen by the threads above

85433-curator-framework.jpg

CuratorFramework is the class used for talking to zookeeper means HS2 is interacting with Zookeeper.

85434-solrclient.jpg

HS2 is interacting with SOLR and Hadoop services (RMI)

85435-ranger.jpg

HS2 is sending audits to ranger, means HS2 is interacting with Ranger

85436-handler.jpg

HS2 has some HiveServer2-Handler thread which are used for doing some reading from thrift Socket
(this is the thread used for responding to connection)

85437-gc.jpg

This view provides overall view of whats happening with the JVM.
GC : Old gen ParNew GC has been happening . and 10 mins have been spend on minor GC
ConcurrentMarkSweep is used for Tenured Gen and 1 min has been spend.

If you look VM summary (Total up-time of 3 days 3 hours) you can find when was the VM started and since then 10 mins has been spend on minor and 1
min on Major GC providing a good overview of how the process is tuned to handle GC and is the heap space sufficient.

85438-vmsummary.jpg

Hence the above information provides the HS2 interaction with
1. Ranger
2. Solr
3. Kafka
4. Hadoop Components (NameNode, DataNode)
5. Kerberos
6. As its running Tez it should be interacting with Resource Manager and Node Manager
7. Heap and GC performance



Lets take a further deep dive of how exactly JVM is performing over time.

use jvisualm to find real time info of Threads and sampling of CPU and RAM

85440-jvisualm.jpg

all the above mentioned information can be derived from command line too


1. find the pid of the process
ps -eaf | grep hiveserver2 (process name ) will fetch you the pid of the process

85443-ps.jpg

2. finding memory usage and gc at real time

85444-jstat.jpg

Total Heap = EDEN SPACE (S0C + S1C + EC ) + Tenured Gen ( OC)

Currently Used Heap = EDEN SPACE (S0U + S1U + EU ) + Tenured Gen (OU)

YGC = 15769 says till now how many time young GC has been done
FGC = 83 How many times the FGC has been done

If you see the count increasing too frequently its time to optimize it .




JMAP to find the instances available inside the class along with their count and memory footprint

85445-jmap.jpg

Know state of threads running use jstack

85450-jstack.jpg



top -p pid

85446-top.jpg


RES = memory it consumes in RAM which should be equal to the heap consumed



Find ports on which the process is listening to and which all clients its is connected

85448-netstat.jpg




To find any errors being reported by the service to client scan the network packets

tcpdump -D to find the interfaces the machine has

tcpdump -i network interface port 10000 -A (tcpdump -i eth0 port 10000 -A )

Scan the port 10000 on which HS2 is listening and look at the packets exchanged to find which client is having the exceptions

85447-tcpdump.jpg

Any GSS exception or any other exception reported by Server to client can be seen in the packet.




To know the cpu consumption and memory consumed by the process top -p pid

85451-top.jpg




To know the the disk I/O read write done by the process use iotop -p pid

85452-iotop.jpg





Some OS commands to know about the how the host is doing running the process
1. iostat -c -x 3 dispaly the CPU and Disk IO

2. mpstat -A ALL get utilization of individual CPU

3. iostat -d -x 5 disk utilization

4. ifstat -t -i interface network utilization






Take away

1. Any exception you see in the process logs for clients can be tracked in the network packet, hence you dont need to enable debug logs ( use tcpdump )
2. A process consumes exceptionally high memory under High GC specially frequent FULL and minor GC
3. In hadoop every service client has a retry mechanism (no client fails after one retry) hence search for retries in the log and try to optimize for that
4. jconsole and jvisualm reveals all important info from threads, memory, cpu utilization, GC
5. keep a check on cpu, network, disk and memory utilization of the process to get holistic overview of the process.
6. In case of failure take heap dump and analyse using jhat for deeper debugging. Jhat will generally consume 6X the memory of the size of the dump (20 GB heap dump will need 120 GB of heap for jhat to run, jhat -J-mx20g hive_dump.hprof)

7. Always refer to the code to have a correlation of th process behavior to memory foot print.
8. jmap output will help you understand what data structure is consuming most of heap and you can point to where in the code the data Structure is being used. Jhat will help you get the tree of the data Structure.


curator-framework.jpgjstack.jpg
1,361 Views