Community Articles

jknulst · ‎10-09-2017

How you can easily create a Kibana dashboard for monitoring your Metron deployment

Suppose you have all your Metron topologies up-and-running and error free. Your work is not finished then. Metron is meant to be a (near) real time alerting system, so how do you guarantee that events flow through the Metron chain of Kafka topics and Storm topologies in a timely manner? What if some events belonging to some threat are hung up in one of the Kafka topics? How to keep track of the topology performance when the throughput of Metron increases over time?

The way to check this is to constantly monitor the consumer lags of the Kafka topics, like this:

**
/usr/hdp/2.5.3.0-37/kafka/bin/kafka-consumer-offset-checker.sh --zookeeper $ZOOKEEPER_HOST:2181 --security-protocol SASL_PLAINTEXT --topic indexing --group indexing


[2017-10-09 21:10:20,278] WARN WARNING: ConsumerOffsetChecker is deprecated and will be dropped in releases following 0.9.0. Use ConsumerGroupCommand instead. (kafka.tools.ConsumerOffsetChecker$)
Group           Topic                          Pid Offset          logSize         Lag             Owner
indexing        indexing                       0   511907295       511907378       83              none
indexing        indexing                       1   511907258       511907387       129             none
indexing        indexing                       2   511907294       511907381       87              none
indexing        indexing                       3   511907138       511907384       246             none
indexing        indexing                       4   511907226       511907378       152             none
indexing        indexing                       5   511907271       511907376       105             none
indexing        indexing                       6   511907250       511907381       131             none
indexing        indexing                       7   511907249       511907379       130             none
indexing        indexing                       8   511907248       511907383       135             none
indexing        indexing                       9   511907283       511907381       98              none
indexing        indexing                       10  511907260       511907383       123             none
indexing        indexing                       11  511907127       511907379       252             none

**We'll assume the cluster is Kerberised.

Or alternatively:

echo "security.protocol=SASL_PLAINTEXT" > /tmp/consgroupcmd.prop
/usr/hdp/2.5.3.0-37/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $BROKER_LIST --new-consumer --describe --command-config /tmp/consgroupcmd.prop --group enrichments

GROUP                          TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
enrichments                    enrichments                    8          260926936       260927064       128
enrichments                    enrichments                    3          260926857       260927060       203
enrichments                    enrichments                    0          260927011       260927062       51
enrichments                    enrichments                    1          260926916       260927068       152
enrichments                    enrichments                    6          260926971       260927059       88 
enrichments                    enrichments                    10         260926914       260927066       152
enrichments                    enrichments                    9          260926970       260927059       89 
enrichments                    enrichments                    11         260926978       260927066       88 
enrichments                    enrichments                    4          260926985       260927066       81
enrichments                    enrichments                    5          260926945       260927067       122
enrichments                    enrichments                    7          260926979       260927066       87 
enrichments                    enrichments                    2          260926984       260927065       81              
[2017-10-09 21:32:28,060] WARN TGT renewal thread has been interrupted and will exit. (org.apache.kafka.common.security.kerberos.KerberosLogin)

The downside of this last command that uses Kafka's ConsumerGroupCommand is that will only return output when the consumer group is active.

In this case both the indexing and the enrichments consumer groups both look healthy as the lag is only around 100-200 per partition. Low lags are good while high consumer lags may signal some kind of problem:

-The topology might not have enough resources (anymore) to cope with the high input (Kafka production) into its source topic

-The topology might be deactivated, crashed or have a problem with the KafkaSpout configuration

When you consumer lags keeps increasing you might run into a state where the topic offset where it was consuming from is no longer on disk because of the topics aging out retention settings (either 'retention.ms' or 'retention.bytes' is exceeded). You should always config the topic to prevent this from happening. If it happens nonetheless manual intervention is required to set things in motion again.

But the most compelling argument to start monitoring Metron consumer lags is that you don't want the events signaling a threat being queued in some Kafka topic after millions of other events. By the time the alert eventually reaches the alerts UI it could already be too late to contain the threat.

It makes sense to schedule the kafka-consumer-offset-checker command and push the results into Elasticsearch (ES) to populate a dashboard. In this how-to I am using a bash script to collect the metrics, leverage logstash (also part of the ES stack) to send the results to ES and use Kibana to visualize.

The script will be like this:

#!/bin/bash

BROKER_LIST=broker-1:6667,broker-2:6667,broker-3:6667
ZOOKEEPER_HOSTS=zk-host1:2181,zk-host2:2181,zk-host3:2181
KAFKA_HOME=/usr/hdp/2.5.3.0-37/kafka

kinit -kt /etc/security/keytabs/metron.headless.keytab metron@<YOUR-REALM>

get_consumer_lag () {

    TOPIC=`echo $1 | cut -d':' -f1`
    GROUP=`echo $1 | cut -d':' -f2`

    $KAFKA_HOME/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --security-protocol SASL_PLAINTEXT --topic $TOPIC --broker-list $BROKER_LIST --time -2 | sed -n "/$TOPIC/p" | sed -e "s/:/\t/g" > /tmp/kafka_start_offset

    TOPIC_MIN_OFFSET=`cat /tmp/kafka_min_offset | cut -f3 | paste -sd+ | bc`

    $KAFKA_HOME/bin/kafka-consumer-offset-checker.sh --zookeeper $ZOOKEEPER_HOSTS --security-protocol SASL_PLAINTEXT --topic $TOPIC --group $GROUP | tr -s " " | cut -d' ' --output-delimiter=' ' -f 1,3,4,5,6 | sed -e '1d' > /tmp/kafka_lag_summary

    TOPIC_MAX_OFFSET=`cat /tmp/kafka_lag_summary | cut -f4 | paste -sd+ | bc`
    CONSUMER_OFFSET=`cat /tmp/kafka_lag_summary | cut -f3 | paste -sd+ | bc`
    CONSUMER_LAG=`cat /tmp/kafka_lag_summary | cut -f5 | paste -sd+ | bc`
    TOPIC_SIZE=$(( $TOPIC_MAX_OFFSET-$TOPIC_START_OFFSET ))
    CONSUMER_REL_MIN=`echo "scale=2; ( $CONSUMER_OFFSET-$TOPIC_MIN_OFFSET ) / $TOPIC_SIZE" | bc -l`

    #CONSUMER_GROUP  TOPIC  TOPIC_MIN_OFFSET  TOPIC_MAX_OFFSET  TOPIC_SIZE   CONSUMER_OFFSET CONSUMER_REL_MIN    CONSUMER_LAG
    echo "$GROUP  $TOPIC  $TOPIC_MIN_OFFSET $TOPIC_MAX_OFFSET   $TOPIC_SIZE   $CONSUMER_OFFSET  $CONSUMER_REL_MIN    $CONSUMER_LAG"
}  

for CONSUMER_TOPIC_AND_GROUP in "suricata:suricata_parser" "checkpoint_lea:checkpoint_lea_parser" "enrichments:enrichments" "indexing:profiler" "indexing:indexing" "indexing:metron"
do
    get_consumer_lag $CONSUMER_TOPIC_AND_GROUP
done

1. Copy the script, customize the BROKER_LIST and ZOOKEEPER_HOSTS variables and the list of " <topic>:<consumer_group> " combinations to loop over. Save it at a location, I will save it at /home/ubuntu/kafka_summary.sh

2. Don't forget to make the new script executable

Some explanation:

The script fires 2 Kafka CLI commands per <topic>:<consumer_group>. Moreover it strips the stdout, makes the data columnar, aggregates over topic partitions (if any) and performs some calculations to come to 1 summary line.

3. Run the script once and check the output. For each <topic>:<cons_group> combination, it should look like this:

[2017-10-09 22:47:22,432] WARN TGT renewal thread has been interrupted and will exit. (org.apache.kafka.common.security.kerberos.KerberosLogin)
[2017-10-09 22:47:23,328] WARN WARNING: ConsumerOffsetChecker is deprecated and will be dropped in releases following 0.9.0. Use ConsumerGroupCommand instead. (kafka.tools.ConsumerOffsetChecker$)
indexing  indexing  5129160011 6143890404   1014730393   6143708348  .99    182056

Lets focus on the line with the numbers. The columns should be read as :

CONSUMER_GROUP TOPIC TOPIC_MIN_OFFSET TOPIC_MAX_OFFSET TOPIC_SIZE CONSUMER_OFFSET CONSUMER_REL_MIN CONSUMER_LAG

The column that need some explanation:

TOPIC_MIN_OFFSET	The lowest available topic offset in Kafka. The data is from the GetOffsetShell tool. This offset will increase as Kafka ages out log segments according to the retention settings. The start offsets of all partitions are aggregated***.
TOPIC_MAX_OFFSET	The ever increasing offset at the most recent end of the topic, as new events are produced. The offsets of all partitions are aggregated
TOPIC_SIZE	TOPIC_MAX_OFFSET - TOPIC_MIN_OFFSET
CONSUMER_OFFSET	The offset where the consumer group is currently at, aggregated over all partitions
CONSUMER_REL_MIN	This indicates how many percentage points the consumer has progressed relative to the TOPIC_MIN_OFFSET. The formula is ( $CONSUMER_OFFSET-$TOPIC_MIN_OFFSET ) / $TOPIC_SIZE. This is a measure of how far the consumer is away from 'the edge' where the data is deleted from the 'old' end of the topic. When this happens event data is lost. High values (0.99) signal that the consumer is as far from the edge as it can be, while low values (< 0.10) call for immediate action.
CONSUMER_LAG	How far the consumer group is behind the TOPIC_MAX_OFFSET, aggregated over all partitions

All these metrics could also be evaluated at the partition level, but I chose to NOT have this level of detail anymore as it complicates matters dramatically later on in Kibana. As a first defense against consumer problems, aggregating over partitions is good enough. Just beware that sometimes lag can be high for one partition while the consumer only has a small lag for another. Usually this is the advent for greater problems shortly after.

The line chart below shows just the time series for the CONSUMER_LAG metric over 36 hours in Kibana. The metrics of the kafka_summary.sh script were run every 5 minutes.

Now we can get some interesting insights about our Metron deployment. The y-axis (consumer lag) has a 'square root' scale type so values in the upper part of the graph are actually worse than it appears.

The consumer groups 'suricata_parser' and 'checkpoint_lea_parser' make up the parser topologies. Both have negligible consumer offsets over time, so they seem more than capable to cope with the incoming volume of sensor events. The same is true for the indexing topology.

In this instance the enrichment topology was restarted and configured to reprocess all of the enrichments topic from the earliest available offset in Kafka (start all over). Although the lag was very high in the beginning at 800 million events, the topology was able to catch up again after 19 hours of processing at roughly 6:30 in the morning. This actually tells us something interesting about the maximum capacity of the enrichment topology. Something you would want to know anyway before taking Metron to production. After catching up it was able to keep up, alas a few non-serious spikes between 10:00 and 14:00.

The blue consumer named 'metron' is exemplary for how things can go wrong if you don't monitor. This group consumes from the indexing topic and is actually not a Metron related consumer but a NiFi consumer that takes the data elsewhere. We can see that the the lag for 'metron' is increasing at roughly the same rate as 'enrichments' decreases because they share the indexing topic as source and sink. Worrying is that 'metron' can not keep the pace, by far, at least not until 14:00 on the second day. So what happened at around 16:00 the first day when the lag was at a whopping 1 billion (!!) events? It seems as though the NiFi process suddenly came alive and consumed and committed a huge volume within 15 minutes. But further investigation showed that this was not what happened. Actually 'old' topic log segments were deleted by Kafka retention policies at the point where the slow NiFi process was consuming from. The consumer offset was lost and stuck but the auto.offset.reset property in the NiFi Kafka consumer made it revert toward the latest available offset in the indexing topic. This stresses the point for monitoring consumer lags closely; data and possible threats could have been lost in this sudden leap through the topic. This should be prevented at all costs.

After the big leap the consumer still was not fast enough to keep up with the indexing topic and again had a lag of nearly 500 million 17 hours later. Then the NiFi process got tuned which increased its capacity. Moreover, the influx into the indexing topic diminished sharply after the enrichment topology cought up 07:00, so the NiFi process was under far less stress and should still be under scrutiny and may need further tuning.

So we have seen that monitoring the consumer lags can be very telling about the Metron topologies and other processing that have a relation to the Metron Kafka topics.

In the next part of this series I will show how to embed the script kafka_summary.sh in a Logstash pipeline to push the metrics into ES and from there into a Kibana dashboard.

Cloudera Community

Community Articles

Metron monitoring with Kafka, Logstash and Kibana [PART-1]

Apache Metron