Community Articles

kgautam · ‎07-31-2018

The general perception that one needs to thoroughly know the code base for Understanding and Debugging a HDP service. Do appreciate the fact that HDP services including the execution engine (MR, TEZ, SPARK) are all JVM based processes. JVM along with the Operating Systems provides various knobs to know what the process is doing and performing at run-time.

The steps mentioned below can be applied for understanding and debugging any JVM process in general.

Lets take an example of how HiveServer2 works assuming one is not much acquainted or has a deep understanding of the code base. We are assuming one knows what the service does, but how it works internally the user is not aware of

1. Process Resource Usage

Gives a complete overview of the usage pattern of CPU, memory by the process providing a quick insight of the health of the process

2. How to figure out which all service the process interacts with

ADD JMX parameters to the service allowing you to visualize whats happening within the JVM at run-time using jconsole or jvisualm

What this ensures is the JVM is broadcasting the metrics on port 7001 and can be connected using jconsole or Jvisualm. There is no security enabled on can add SSL certificates too for authentication

What can we infer about the kind of threads we see from jconsole

Number of threads have peaked to 461, currently only 170 are active

Abandoned connection cleaner is cleaning the lost connections.

Kafka- kerberos-refres-thread : As HS2 supports kerberos, TGT needs to be refreshed after the renewal period. It means we dont have to manually refresh the kerberos ticket.

HS2 is interacting with Kafka and Atlas as can be seen by the threads above

CuratorFramework is the class used for talking to zookeeper means HS2 is interacting with Zookeeper.

HS2 is interacting with SOLR and Hadoop services (RMI)

HS2 is sending audits to ranger, means HS2 is interacting with Ranger

HS2 has some HiveServer2-Handler thread which are used for doing some reading from thrift Socket
(this is the thread used for responding to connection)

This view provides overall view of whats happening with the JVM.
GC : Old gen ParNew GC has been happening . and 10 mins have been spend on minor GC
ConcurrentMarkSweep is used for Tenured Gen and 1 min has been spend.

If you look VM summary (Total up-time of 3 days 3 hours) you can find when was the VM started and since then 10 mins has been spend on minor and 1
min on Major GC providing a good overview of how the process is tuned to handle GC and is the heap space sufficient.

Hence the above information provides the HS2 interaction with
1. Ranger
2. Solr
3. Kafka
4. Hadoop Components (NameNode, DataNode)
5. Kerberos
6. As its running Tez it should be interacting with Resource Manager and Node Manager
7. Heap and GC performance

Lets take a further deep dive of how exactly JVM is performing over time.

use jvisualm to find real time info of Threads and sampling of CPU and RAM

all the above mentioned information can be derived from command line too

1. find the pid of the process
ps -eaf | grep hiveserver2 (process name ) will fetch you the pid of the process

2. finding memory usage and gc at real time

Total Heap = EDEN SPACE (S0C + S1C + EC ) + Tenured Gen ( OC)

Currently Used Heap = EDEN SPACE (S0U + S1U + EU ) + Tenured Gen (OU)

YGC = 15769 says till now how many time young GC has been done
FGC = 83 How many times the FGC has been done

If you see the count increasing too frequently its time to optimize it .

JMAP to find the instances available inside the class along with their count and memory footprint

Know state of threads running use jstack

top -p pid

RES = memory it consumes in RAM which should be equal to the heap consumed

Find ports on which the process is listening to and which all clients its is connected

To find any errors being reported by the service to client scan the network packets

tcpdump -D to find the interfaces the machine has

tcpdump -i network interface port 10000 -A (tcpdump -i eth0 port 10000 -A )

Scan the port 10000 on which HS2 is listening and look at the packets exchanged to find which client is having the exceptions

Any GSS exception or any other exception reported by Server to client can be seen in the packet.

To know the cpu consumption and memory consumed by the process top -p pid

To know the the disk I/O read write done by the process use iotop -p pid

Some OS commands to know about the how the host is doing running the process
1. iostat -c -x 3 dispaly the CPU and Disk IO

2. mpstat -A ALL get utilization of individual CPU

3. iostat -d -x 5 disk utilization

4. ifstat -t -i interface network utilization

Take away

1. Any exception you see in the process logs for clients can be tracked in the network packet, hence you dont need to enable debug logs ( use tcpdump )
2. A process consumes exceptionally high memory under High GC specially frequent FULL and minor GC
3. In hadoop every service client has a retry mechanism (no client fails after one retry) hence search for retries in the log and try to optimize for that
4. jconsole and jvisualm reveals all important info from threads, memory, cpu utilization, GC
5. keep a check on cpu, network, disk and memory utilization of the process to get holistic overview of the process.
6. In case of failure take heap dump and analyse using jhat for deeper debugging. Jhat will generally consume 6X the memory of the size of the dump (20 GB heap dump will need 120 GB of heap for jhat to run, jhat -J-mx20g hive_dump.hprof)

7. Always refer to the code to have a correlation of th process behavior to memory foot print.
8. jmap output will help you understand what data structure is consuming most of heap and you can point to where in the code the data Structure is being used. Jhat will help you get the tree of the data Structure.

Cloudera Community

Community Articles

Generic Steps of Understanding/ Debugging Services on HDP ( even if you don't know much about the source code )

Apache Hive