Member since
02-12-2016
33
Posts
44
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5842 | 11-10-2018 12:59 AM | |
2177 | 10-30-2018 12:47 AM | |
3513 | 04-25-2016 08:06 PM |
05-08-2020
09:30 AM
In Part 1, I shared an alternative architecture for Prometheus that increases scalability and flexibility of time series metric monitoring. In part 2, I will walk through an extension of the architecture that unites metric and log data with a unified, scalable data pipeline.
Part 1 Architecture
Part 2 Architecture
By adding log collection agents (e.g. MiNiFi) and a search cluster (e.g. Solr), the solution architecture can be extended to support log data in addition to time series metrics. This has the advantage of reducing duplicated infrastructure components for a more efficient and supportable solution.
Additional NiFi processors can be added to the flow for pre-processing (e.g. filtering, routing, scoring) the incoming data (e.g. ERROR vs INFO messages). Rulesets (e.g. Drools) from expert systems can be embedded directly into the flow, while ML models can be either directly embedded or hosted as a service that NiFi calls. Further downstream, Flink can be used to apply stateful stream processing (e.g. joins, windowing).
By applying these advanced analytics to metrics and logs in-stream, before the data lands, operations teams can shift from digging through charts and graphs to acting on intelligent, targeted alerts with the full context necessary to resolve any issue that may arise.
The journey to streaming analytics with ML and expert systems requires rethinking architectures, but the value gained from the timely insights that otherwise would not be possible is well worth the upfront refactoring and results in a much more stable and efficient system in the long run.
... View more
11-27-2019
08:58 AM
2 Kudos
Traditional Prometheus Architecture (Image Courtesy: https://prometheus.io/docs/introduction/overview/) Prometheus is great. It has a huge number of integrations and provides a great metric monitoring platform, especially when working with Kubernetes. However, it does have a few shortcomings. The aim of this alternative architecture is to preserve the best parts of Prometheus while augmenting its weaker points with more powerful technologies. The service discovery and metric scraping framework in Prometheus is its greatest strength, but it is greatly limited by its tight coupling to the TSDB system inside Prometheus. While it is possible to replace the TSDB inside Prometheus with an external database, the data retrieval process only supports writing into this one database. Maui Architecture The greatest strength of Prometheus, the service discovery and metric scraping framework, can now be used within Apache NiFi with the introduction of the GetPrometheusMetrics processor. This processor uses cgo and JNI to leverage the actual Prometheus libraries for service discovery and metric scraping. The standard Prometheus YML configurations are provided to the processor and JSON data is output as it scrapes metrics from configured and/or discovered endpoints. When combined with NiFi’s HTTP listener processors, the entire data ingestion portion of Prometheus can be embedded within NiFi. The advantage of NiFi for data ingestion is that it comes with a rich set of processors for transforming, filtering, routing, and publishing data, potentially to many different places. The ability to load data into the data store (or data stores) of choice increases extensibility and enables more advanced analytics. One good option for the datastore is Apache Druid. Druid was built for both real-time and historical analytics at scale (ingest of millions of events per second plus petabytes of history). It is supported by many dashboarding tools natively (such as Grafana or Superset), and it supports SQL through JDBC, making it accessible from a wide array of tools (such as Tableau). Druid addresses the scalability issues of the built-in TSDB while still providing a similar user experience and increasing extensibility to more user interfaces. The option of sending scraped data to many locations provides an easy way to integrate with other monitoring frameworks, or to perform advanced analytics and machine learning. For example, loading metrics into Kafka makes it accessible in real-time to stream processing engines (like Apache Flink), function as a service engines (like OpenWhisk), and custom microservices. With this architecture it is now possible to apply ML to Prometheus-scaped metrics in real-time and to activate functions when anomalies are found. Part 2 of this article can be found here. Artifacts The GetPrometheusMetrics processor can be found in this repository: https://github.com/SamHjelmfelt/nifi-prometheus-metrics A sample NiFi template using GetPrometheusMetrics to write into both Druid and Kafka can be found here: https://gist.github.com/SamHjelmfelt/f04aae5489fa88bdedd4bba211d083e0
... View more
Labels:
01-09-2019
02:31 AM
2 Kudos
Docker on YARN is relatively easy to set up on an existing cluster, but full clusters are not always available. My Ember Project was created to simplify dev/test for technologies managed by Ambari and Cloudera Manager. It provides utilities for creating dockerized clusters that use far fewer resources than a full bare metal or VM-based cluster. Additionally, by using pre-built images, the time it takes to get a cluster up and running can be reduced to less than 10 minutes. The following four commands are all that is necessary to download and run a ~5GB image that comes preinstalled with Ambari, Zookeeper, HDFS, and YARN with Docker on YARN pre-configured. Docker containers spawned by YARN will be created on the host machine as peers to the container with YARN inside. All containers are launched into the "ember" docker network by default. Once the container is downloaded, it takes less than 5 minutes to start all services. curl -L https://github.com/SamHjelmfelt/Ember/archive/v1.1.zip -o Ember_1.1.zip
unzip Ember_1.1.zip
cd Ember-1.1/
./ember.sh createFromPrebuiltSample samples/yarnquickstart/yarnquickstart-sample.ini The Ambari UI can be found at http://localhost:8080 The YARN Resource Manager UI can be found at http://localhost:8088 Usage The YARN service REST API documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html The YARN app CLI documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app Testing Place the following service definition into a file (e.g. redis.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
]
} Submit the service with the following curl command. YARN should respond back with the applicationId curl -X POST -H "Content-Type: application/json" http://localhost:8088/app/v1/services?user.name=ambari-qa -d @redis.json The service status can be viewed on the YARN UI or through the REST API (python makes it easier to read): curl http://localhost:8088/app/v1/services/redis-service?user.name=ambari-qa | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl -X DELETE http://localhost:8088/app/v1/services/redis-service?user.name=ambari-qa
... View more
Labels:
11-10-2018
01:51 AM
5 Kudos
Update: See here for a Docker on YARN sandbox solution: https://community.hortonworks.com/articles/232540/docker-on-yarn-sandbox.html Overview This guide has been tested with and without Kerberos on HDP 3.0.1. YARN offers a DNS service backed by Zookeeper for service discovery, but that can be challenging to setup. For a quickstart scenario, I will use docker swarm and an overlay network instead. If your environment is a single host, the networking is even simpler. This configuration is not recommended for production. I will use pssh to run commands in parallel across the cluster based on a hostlist file and a workerlist file. The hostlist file should contain every host in the cluster, and the workerlist file should include every node except for the one chosen to be the docker swarm master node. Prerequisites Install HDP 3.0.1 with or without Kerberos Install Docker on every host in the cluster #pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem -o 'StrictHostKeyChecking no'" "echo hostname"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum install -y yum-utils device-mapper-persistent-data lvm2"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo";
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum install -y docker-ce"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl start docker"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl enable docker" Configure docker swarm and create an overlay network ssh -i ~/cloudbreak.pem cloudbreak@<masternode> "sudo docker swarm init"
pssh -i -h workerlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo <output from last command: docker swarm join ...>"
ssh -i ~/cloudbreak.pem cloudbreak@<masternode> "sudo docker network create -d overlay --attachable yarnnetwork" If Kerberos is not enabled, create a default user for containers: pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo useradd dockeruser" Ambari In the YARN general settings tab, toggle the Docker Runtime button to "Enabled". This should change the following setting in Advanced YARN-Site: yarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor In Advanced YARN-Site, change the following, so all YARN docker containers use the overlay network we created by default: yarn.nodemanager.runtime.linux.docker.default-container-network=yarnnetwork
yarn.nodemanager.runtime.linux.docker.allowed-container-networks=host,none,bridge,yarnnetwork In Custom YARN-Site, add the following if kerberos is not enabled: yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user=dockeruser In Advanced Container Executor: Note that this allows any image from docker hub to be run. to limit the that docker images that can be run, set this property to a comma separated list of trusted registries. Docker images have the form <registry>/<imageName>:<tag>. docker_trusted_registries=* Alternatively, the following Ambari blueprint encapsulates these configurations: {
"configurations" : [
{
"yarn-site" : {
"properties" : {
"yarn.nodemanager.container-executor.class" : "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor",
"yarn.nodemanager.runtime.linux.docker.default-container-network" : "yarnnetwork",
"yarn.nodemanager.runtime.linux.docker.allowed-container-networks" : "host,none,bridge,yarnnetwork",
"yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user" : "dockeruser"
}
}
},
{
"container-executor" : {
"properties" : {
"docker_trusted_registries" : "library",
"docker_module_enabled" : "true"
}
}
}],
"host_groups" : [
{
"name" : "all",
"components" : [
{"name" : "HISTORYSERVER"},
{"name" : "NAMENODE"},
{"name" : "APP_TIMELINE_SERVER"},
{"name" : "NODEMANAGER"},
{"name" : "DATANODE"},
{"name" : "RESOURCEMANAGER"},
{"name" : "ZOOKEEPER_SERVER"},
{"name" : "SECONDARY_NAMENODE"},
{"name" : "HDFS_CLIENT"},
{"name" : "ZOOKEEPER_CLIENT"},
{"name" : "YARN_CLIENT"},
{"name" : "MAPREDUCE2_CLIENT"}
],
"cardinality" : "1"
}
],
"Blueprints" : {
"blueprint_name" : "yarn sample",
"stack_name" : "HDP",
"stack_version" : "3.0"
}
} Save the configurations and restart YARN Usage The YARN service REST API documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html The YARN app CLI documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app Testing without Kerberos Place the following service definition into a file (e.g. yarnservice.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
]
} Submit the service with the following curl command. YARN should respond back with the applicationId The user will need write permission on their HDFS home directory(e.g. hdfs:/user/user1). ambari-qa has it by default. curl -X POST -H "Content-Type: application/json" http://<resource manager>:8088/app/v1/services?user.name=ambari-qa -d @yarnservice.json The service status can be viewed on the YARN UI, or through the REST APIs (python makes it easier to read): curl http://<resource manager>:8088/app/v1/services/redis-service?user.name=ambari-qa | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl -X DELETE http://<resource manager>:8088/app/v1/services/redis-service?user.name=ambari-qa Testing with Kerberos Create a kerberos principal of the format <username>/<hostname>@<realm> The hostname portion of the principal is required. Create a keytab for the principal and upload it to HDFS kadmin.local
>addprinc user1/host1.example.com@EXAMPLE.COM
...
>xst -k user1_host1.keytab user1/host1.example.com@EXAMPLE.COM
...
>exit
hadoop fs -put user1_host1.keytab hdfs:/user/user1/
hadoop fs -chown user1 hdfs:/user/user1/ Place the following service definition into a file (e.g. yarnservice.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
],
"kerberos_principal": {
"principal_name": "user1/host1.example.com@EXAMPLE.COM",
"keytab": "hdfs:/user/user1/user1_host1.keytab"
}
} Submit the service with the following curl command. YARN should respond back with the applicationId User1 will need permission to write into their HDFS home directory: (hdfs:/user/user1) curl --negotiate -u : -X POST -H "Content-Type: application/json" http://<resource manager>:8088/app/v1/services -d @yarnservice.json The service status can be viewed on the YARN UI, or through the REST APIs (python makes it easier to read): curl --negotiate -u : http://<resource manager>:8088/app/v1/services/redis-service | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl --negotiate -u : -X DELETE http://<resource manager>:8088/app/v1/services/redis-service Adding a local docker registry Each node in the cluster needs a way of downloading docker images when a service is run. It is possible to just use the public docker hub, but that is not always an option. Similar to creating a local repo for yum, a local registry can be created for Docker. Here is a quickstart that skips the security steps. In production, security best practices should be followed: On a master node, create an instance of the docker registry container. This will bind the registry to port 5000 on the host machine. docker run -d -p 5000:5000 --restart=always --name registry -v /mnt/registry:/var/lib/registry registry:2 Configure each machine to skip HTTPS checks: https://docs.docker.com/registry/insecure/. Here are commands for CentOS 7: pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo echo '{\"insecure-registries\": [\"<registryHost>:5000\"]}' | sudo tee --append /etc/docker/daemon.json"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl restart docker" The YARN service configuration, docker_trusted_registries, needs to be set to star ( * ) or needs to have this local registry in its list (e.g. library,<registryHost>:5000). Restart YARN Testing Local Docker Registry Build, tag, and push an image to the registry docker build -t myImage:1 .
docker tag myImage:1 <registryHost>:5000/myImage:1
docker push <registryHost>:5000/myImage:1 View image via REST curl <registryHost>:5000/v2/_catalog
curl <registryHost>:5000/v2/_catalog/myImage/tags/list Download image to all hosts in cluster (only necessary to demonstrate connectivity. Docker and YARN do this automatically) pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo docker pull <registryHost>:5000/myImage:1" Now, when an image with this registry prefix (e.g. <registryHost>:5000/myImage:1) is used in a YARN service definition, YARN will use the image from this local registry instead of trying to pull from the default public location.
... View more
Labels:
11-12-2018
03:53 PM
YARN-7787 is open to discuss the issue, but there is no clear solution.
... View more
11-10-2018
12:59 AM
Turns out, the guide that I was following was outdated. I tried this again on a different cluster and it worked perfectly with the Ambari default for yarn.nodemanager.linux-container-executor.cgroups.mount-path ("/cgroup")
... View more
10-30-2018
12:47 AM
This error was resolved by explicitly specifying the content type as JSON: curl ... -H "Content-Type: application/json"
... View more
10-04-2018
08:11 PM
2 Kudos
Most
data movement use cases do not require a “shuffle phase” for redistributing
FlowFiles across a NiFi cluster, but there are few cases where it is useful.
For example: ListFile
-> FetchFile ListHDFS
-> FetchHDFS ListFTP
-> FetchFTP GenerateTableFetch
-> ExecuteSQL GetSQS
-> FetchS3 In
each case, the flow starts with a processor that generates tasks to run (e.g.
filenames) followed by the actual execution of those tasks. To scale, tasks
need to run on each node in the NiFi cluster, but for consistency, the task
generation should only run on the primary node. The solution is to introduce a
shuffle (aka load balancing) step in between task generation and task
execution. Processors
can be configured to run on the primary node by going to “View
Configuration”-> “Scheduling” and selecting “Primary node only” under
“Execution”. The
shuffle step is not an explicit component on the NiFi canvas, but rather the
combination of a Remote Input Port and a Remote Process Group pointing at the
local cluster. FlowFiles that are sent to the Remote Process Group will be load
balanced over Site-to-Site and come back into the flow via the Remote Input
Port. Under “Manage Remote Ports” on the Remote Process Group there are batch
settings that help control the load balancing. Here
are two example flows that use this design pattern:
... View more
Labels:
05-14-2018
08:51 PM
7 Kudos
A quick glance at NiFi’s 252+ processors shows that it can
solve a wide array of use cases out of the box. What is not immediately obvious is the flexibility that its attributes and expression language
can provide. This allows it to quickly,
easily, and efficiently solve complex use cases that would require significant customization to solve in other solutions. For example, sending all of the incoming data to both Kafka
and HDFS while sending 10% to a dev environment and a portion to a partner
system based on the content of the data (e.g. CustomerName=ABC). These more
complex routing scenarios are easily accommodated using UpdateAttribute,
RouteOnAttribute, and RouteOnContent. Another example of NiFi’s flexibility is the ability to
multiplex data flows. In traditional ETL systems, the schema is tightly coupled
to the data as it moves between systems, because transformations occur in
transit. In more modern ELT scenarios, the data is often loaded into the
destination with minimal transformations before the complex transformation step
is kicked off. This has many advantages and allows NiFi to focus on the EL
portion of the flow. When focused on EL, there is far less of a need for the
movement engine to be schema aware since it is general focused on simple
routing, filtering, format translation, and concatenation. One common scenario
is when loading data from many Kafka topics into their respective HDFS
directories and/or Hive tables with only simple transformations. In traditional
systems, this would require one flow per topic, but by parameterizing flows, one flow can be used for all topics. In the image below you can see the configurations and
attributes that make this possible. The ConsumeKafka processor can use a list
of topics or a regular expression to consume from many topics at once. Each
FlowFile (e.g. batch of Kafka messages) has an attribute added called
"kafka.topic" to identify its source topic. Next, in order to load streaming data into HDFS or Hive, it
is recommended to use MergeContent to combine records into large files (e.g.
every 1GB or every 15 minutes). In MergeContent, setting the “correlation
attribute” configuration to “kafka.topic” ensures that only records from the
same kafka topic are combined (similar to a group-by clause). After the files
are merged, the “directory” configuration in HDFS can be parameterized (e.g.
/myDir/${kafka.topic}) in order to load the data into the correct directory
based on the kafka topic name. Note that this diagram includes a retry and notify on
failure process group. This type of solution is highly recommended for
production flows. More information can be found here. This example could easily be extended to include file format
translation (e.g. ConverAvroToORC), filtering (e.g. RouteOnContent),
kafka-topic to HDFS-directory mapping (e.g. UpdateAttribute). It can even
trigger downstream processing (e.g. ExecuteSparkInteractive, PutHiveQL,
ExecuteStreamCommand, etc.) or periodically update metrics and logging
solutions such as Graphite, Druid, or Solr. Of course, this solution also
applies to many more data stores than just Kafka and HDFS. Overall, parameterizing flows in NiFi for multiplexing can
reduce complexity for EL use cases and simplify administration. This design is
straightforward to implement and uses core NiFi features. It is also easily
extended to a variety of use cases.
... View more
Labels: