Created on 05-02-2016 05:22 PM - edited on 02-26-2020 06:23 AM by SumitraMenon
When adding a net new data source to Metron, the first step is to decide how to push the events from the new telemetry data source into Metron. You can use a number of data collection tools and that decision is decoupled from Metron. However, we recommend evaluating Apache Nifi as it is an excellent tool to do just that (this article uses Nifi to push data into Metron). The second step is to configure Metron to parse the telemetry data source so that downstream processing can be done on it. In this article we will walk you through how to perform both of these steps.
In the previous article of this blog series, we described the following set of requirements for Customer Foo who wanted to add the Squid telemetry data source Into Metron.
In this article, we will walk you through how to perform steps 1, 2, and 6.
The following steps guide you through how to add this new telemetry.
cd /usr/hdp/current/kafka-broker/bin/ ./kafka-topics.sh --zookeeper localhost:2181 --create --topic squid --partitions 1 --replication-factor 1
./kafka-topics.sh --zookeeper localhost:2181 --list
You should see the following list of Kafka topics:
sudo yum install squid sudo service squid start
sudo su - cd /var/log/squid ls
You see that there are three types of logs available: access.log, cache.log, and squid.out. We are interested in access.log becasuse that is the log that records the proxy usage.
squidclient -h 127.0.0.1 http://www.hostsite.com squidclient -h 127.0.0.1 http://www.hostsite.com cat /var/log/squid/access.log
In production environments you would configure your users web browsers to point to the proxy server, but for the sake of simplicity of this tutorial we will use the client that is packaged with the Squid installation. After we use the client to simulate proxy requests, the Squid log entries should look as follows:
1461576382.642 161 127.0.0.1 TCP_MISS/200 103701 GET http://www.hostsite.com/ - DIRECT/199.27.79.73 text/html 1461576442.228 159 127.0.0.1 TCP_MISS/200 137183 GET http://www.hostsite.com/ - DIRECT/66.210.41.9 text/html
timestamp | time elapsed | remotehost | code/status | bytes | method | URL rfc931 peerstatus/peerhost | type
Now we are ready to tackle the Metron parsing topology setup.
WDOM [^(?:http:\/\/|www\.|https:\/\/)]([^\/]+) SQUID_DELIMITED %{NUMBER:timestamp} %{SPACE:UNWANTED} %{INT:elapsed} %{IPV4:ip_src_addr} %{WORD:action}/%{NUMBER:code} %{NUMBER:bytes} %{WORD:method} http:\/\/\www.%{WDOM:url}\/ - %{WORD:UNWANTED}\/%{IPV4:ip_dst_addr} %{WORD:UNWANTED}\/%{WORD:UNWANTED}
Notice the WDOM pattern (that is more tailored to Squid instead of using the generic Grok URL pattern) before defining the Squid log pattern. This is optional and is done for ease of use. Also, notice that we apply the UNWANTED tag for any part of the message that we don't want included in our resulting JSON structure. Finally, notice that we applied the naming convention to the IPV4 field by referencing the following list of field conventions.
touch /tmp/squid vi /tmp/squid //copy the grok pattern above to the squid file
su - hdfs hdfs dfs -put /tmp/squid /apps/metron/patterns/ exit
mkdir /usr/metron/0.1BETA/flux/squid cp /usr/metron/0.1BETA/flux/yaf/remote.yaml /usr/metron/0.1BETA/flux/squid/remote.yaml vi /usr/metron/0.1BETA/flux/squid/remote.yaml
name: "squid" config: topology.workers: 1 components: - id: "parser" className: "org.apache.metron.parsers.GrokParser" constructorArgs: - "/apps/metron/patterns/squid" - "SQUID_DELIMITED" configMethods: - name: "withTimestampField" args: - "timestamp" - id: "writer" className: "org.apache.metron.parsers.writer.KafkaWriter" constructorArgs: - "${kafka.broker}" - id: "zkHosts" className: "storm.kafka.ZkHosts" constructorArgs: - "${kafka.zk}" - id: "kafkaConfig" className: "storm.kafka.SpoutConfig" constructorArgs: # zookeeper hosts - ref: "zkHosts" # topic name - "squid" # zk root - "" # id - "squid" properties: - name: "ignoreZkOffsets" value: true - name: "startOffsetTime" value: -1 - name: "socketTimeoutMs" value: 1000000 spouts: - id: "kafkaSpout" className: "storm.kafka.KafkaSpout" constructorArgs: - ref: "kafkaConfig" bolts: - id: "parserBolt" className: "org.apache.metron.parsers.bolt.ParserBolt" constructorArgs: - "${kafka.zk}" - "squid" - ref: "parser" - ref: "writer" streams: - name: "spout -> bolt" from: "kafkaSpout" to: "parserBolt" grouping: type: SHUFFLE
sudo storm jar /usr/metron/0.1BETA/lib/metron-parsers-0.1BETA.jar org.apache.storm.flux.Flux --filter /usr/metron/0.1BETA/config/elasticsearch.properties --remote /usr/metron/0.1BETA/flux/squid/remote.yaml
Put simply NiFi was built to automate the flow of data between systems. Hence it is a fantastic tool to collect, ingest and push data to Metron. The below instructions on how to install configure and create the nifi flow to push squid events into Metron.
The following shows how to install Nifi on the VM. Do the following as root:
cd /usr/lib wget http://public-repo-1.hortonworks.com/HDF/centos6/1.x/updates/1.2.0.0/HDF-1.2.0.0-91.tar.gz tar -zxvf HDF-1.2.0.0-91.tar.gz
cd HDF-1.2.0.0/nifi vi conf/nifi.properties //update nifi.web.http.port to 8089
bin/nifi.sh install nifi
service nifi start
Now we will create a flow to capture events from squid and push them into metron
squidclient http://www.hostsite.com
By convention the index where the new messages will be indexed is called squid_index_[timestamp] and the document type is squid_doc.
In order to verify that the messages were indexed correctly, we can use the elastic search Head plugin.
/usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head/1.x
You should see the message: Installed mobz/elasticsearch-head/1.x into /usr/share/elasticsearch/plugins/head
2. Navigate to elastic head UI: http://node1:9200/_plugin/head/
3. Click on Browser tab and select squid doc on the left panel and then select one of the sample docs. You should see something like the following:
Now that we have Metron configured to parse, index and persist telemetry events and Nifi pushing data to Metron, lets now visualize this streaming telemetry data in the Metron UI.
Created on 05-10-2016 03:08 PM
In sub-step 2 of Step 1 below the example does not match the tarball.
Should be
tar -zxvf codelab-v1.0.tar.gz
Created on 05-10-2016 05:03 PM
@apsaltis When I downloaded the tar file it was named "incubator-metron-codelab-v1.0.tar.gz" which means the example to untar the file should be correct. Would you please check your download again to confirm that it is named "codelab-v1.0.tar.gz". Thanks!
Created on 05-10-2016 05:45 PM
Yeah worked just fine this time. Did wget against GH url directly, as my virus software kept blocking the GH url as it believes it is infected. Sorry for the noise.
Created on 05-10-2016 07:57 PM
After performing all of the steps in "Step 1: Spin Up Single Node Vagrant VM", Storm is up and running with 4 slots and 4 topologies running. The user is then left with the issue that was described here: no-workers-in-storm-for-squid-topology
Would it make sense to add another sub-step before sub-step 5 that instructs the user to add a port to the "supervisor.slots.ports:"[6700, 6701, 6702, 6703]" found in: metron-deployment/roles/ambari_config/vars/single_node_vm.yml ?
Created on 05-11-2016 11:05 PM
For: Install the head plugin
I think this should be:
1. sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head/1.x
Created on 05-13-2016 09:41 PM
@apsaltis I'm modifying my previous reply. To deploy the new squid parser topology, you do not need to use "sudo" anywhere except at the very beginning of the command string. I tried it twice as is and it worked perfectly both times. Thank you for your feedback though. We really appreciate you taking the time to comment on the instructions so we can improve them.
Created on 05-13-2016 09:56 PM
, Thanks for your comment. I believe you're correct and we need an additional step to add a port. I'm researching the best way to add that port and I will modify that step when I have my results. Thanks for your help.
Created on 06-03-2016 07:56 PM
@apsaltis After some research I discovered the easiest way to ensure that Squid is assigned a worker is to kill one or more of the existing topologies. The Storm Supervisor will then assign one of the free workers to Squid. You can kill a topology either in the Storm UI or in the CLI. I will add a step to cover this in the article. Thanks again for your comments.
Created on 06-03-2016 08:03 PM
That certainly works. It has the same effect as adding an available port to Storm so that the new topology can be run. Not sure which is cleaner -- have a user kill another topology before deploying the squid one, update the canned storm config to have a port available, or have a user update the config and restart Storm and related services.