Member since
02-12-2016
33
Posts
42
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4196 | 11-10-2018 12:59 AM | |
1376 | 10-30-2018 12:47 AM | |
2390 | 04-25-2016 08:06 PM |
05-08-2020
09:30 AM
In Part 1, I shared an alternative architecture for Prometheus that increases scalability and flexibility of time series metric monitoring. In part 2, I will walk through an extension of the architecture that unites metric and log data with a unified, scalable data pipeline.
Part 1 Architecture
Part 2 Architecture
By adding log collection agents (e.g. MiNiFi) and a search cluster (e.g. Solr), the solution architecture can be extended to support log data in addition to time series metrics. This has the advantage of reducing duplicated infrastructure components for a more efficient and supportable solution.
Additional NiFi processors can be added to the flow for pre-processing (e.g. filtering, routing, scoring) the incoming data (e.g. ERROR vs INFO messages). Rulesets (e.g. Drools) from expert systems can be embedded directly into the flow, while ML models can be either directly embedded or hosted as a service that NiFi calls. Further downstream, Flink can be used to apply stateful stream processing (e.g. joins, windowing).
By applying these advanced analytics to metrics and logs in-stream, before the data lands, operations teams can shift from digging through charts and graphs to acting on intelligent, targeted alerts with the full context necessary to resolve any issue that may arise.
The journey to streaming analytics with ML and expert systems requires rethinking architectures, but the value gained from the timely insights that otherwise would not be possible is well worth the upfront refactoring and results in a much more stable and efficient system in the long run.
... View more
11-27-2019
08:58 AM
2 Kudos
Traditional Prometheus Architecture (Image Courtesy: https://prometheus.io/docs/introduction/overview/) Prometheus is great. It has a huge number of integrations and provides a great metric monitoring platform, especially when working with Kubernetes. However, it does have a few shortcomings. The aim of this alternative architecture is to preserve the best parts of Prometheus while augmenting its weaker points with more powerful technologies. The service discovery and metric scraping framework in Prometheus is its greatest strength, but it is greatly limited by its tight coupling to the TSDB system inside Prometheus. While it is possible to replace the TSDB inside Prometheus with an external database , the data retrieval process only supports writing into this one database. Maui Architecture The greatest strength of Prometheus, the service discovery and metric scraping framework, can now be used within Apache NiFi with the introduction of the GetPrometheusMetrics processor. This processor uses cgo and JNI to leverage the actual Prometheus libraries for service discovery and metric scraping. The standard Prometheus YML configurations are provided to the processor and JSON data is output as it scrapes metrics from configured and/or discovered endpoints. When combined with NiFi’s HTTP listener processors, the entire data ingestion portion of Prometheus can be embedded within NiFi. The advantage of NiFi for data ingestion is that it comes with a rich set of processors for transforming, filtering, routing, and publishing data, potentially to many different places. The ability to load data into the data store (or data stores) of choice increases extensibility and enables more advanced analytics. One good option for the datastore is Apache Druid. Druid was built for both real-time and historical analytics at scale (ingest of millions of events per second plus petabytes of history). It is supported by many dashboarding tools natively (such as Grafana or Superset), and it supports SQL through JDBC, making it accessible from a wide array of tools (such as Tableau). Druid addresses the scalability issues of the built-in TSDB while still providing a similar user experience and increasing extensibility to more user interfaces. The option of sending scraped data to many locations provides an easy way to integrate with other monitoring frameworks, or to perform advanced analytics and machine learning. For example, loading metrics into Kafka makes it accessible in real-time to stream processing engines (like Apache Flink), function as a service engines (like OpenWhisk), and custom microservices. With this architecture it is now possible to apply ML to Prometheus-scaped metrics in real-time and to activate functions when anomalies are found. Part 2 of this article can be found here. Artifacts The GetPrometheusMetrics processor can be found in this repository: https://github.com/SamHjelmfelt/nifi-prometheus-metrics A sample NiFi template using GetPrometheusMetrics to write into both Druid and Kafka can be found here: https://gist.github.com/SamHjelmfelt/f04aae5489fa88bdedd4bba211d083e0
... View more
- Find more articles tagged with:
- NiFi
- nifi-processor
- nifi-templates
Labels:
01-24-2019
08:01 PM
1 Kudo
Repo Description Ember provides a solution for running Ambari and Cloudera Manager
clusters in Docker (HDP, CDH, and HDF). It was designed to streamline
training, testing, and development by enabling multi-node dev/test clusters to be installed on a single machine with minimal resource requirements. Repo Info Github Repo URL https://github.com/SamHjelmfelt/Ember Github account name SamHjelmfelt Repo name Ember
... View more
- Find more articles tagged with:
- Ambari
- ambari-blueprint
- cloudera-manager
- docker
- Sandbox
- Sandbox & Learning
- utilities
Labels:
01-09-2019
02:31 AM
2 Kudos
Docker on YARN is relatively easy to set up on an existing cluster, but full clusters are not always available. My Ember Project was created to simplify dev/test for technologies managed by Ambari and Cloudera Manager. It provides utilities for creating dockerized clusters that use far fewer resources than a full bare metal or VM-based cluster. Additionally, by using pre-built images, the time it takes to get a cluster up and running can be reduced to less than 10 minutes. The following four commands are all that is necessary to download and run a ~5GB image that comes preinstalled with Ambari, Zookeeper, HDFS, and YARN with Docker on YARN pre-configured. Docker containers spawned by YARN will be created on the host machine as peers to the container with YARN inside. All containers are launched into the "ember" docker network by default. Once the container is downloaded, it takes less than 5 minutes to start all services. curl -L https://github.com/SamHjelmfelt/Ember/archive/v1.1.zip -o Ember_1.1.zip
unzip Ember_1.1.zip
cd Ember-1.1/
./ember.sh createFromPrebuiltSample samples/yarnquickstart/yarnquickstart-sample.ini The Ambari UI can be found at http://localhost:8080 The YARN Resource Manager UI can be found at http://localhost:8088 Usage The YARN service REST API documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html The YARN app CLI documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app Testing Place the following service definition into a file (e.g. redis.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
]
} Submit the service with the following curl command. YARN should respond back with the applicationId curl -X POST -H "Content-Type: application/json" http://localhost:8088/app/v1/services?user.name=ambari-qa -d @redis.json The service status can be viewed on the YARN UI or through the REST API (python makes it easier to read): curl http://localhost:8088/app/v1/services/redis-service?user.name=ambari-qa | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl -X DELETE http://localhost:8088/app/v1/services/redis-service?user.name=ambari-qa
... View more
- Find more articles tagged with:
- ambari-server
- docker
- ember
- Hadoop Core
- How-ToTutorial
- YARN
Labels:
12-13-2018
04:18 PM
When I launch a dockerized yarn service, the containers are being removed and restarted after ~13 seconds. This repeats 20+ times before a container eventually is able to stay up. Here are entries from the RM log where it seems to be unable to find the container. 2018-12-13 15:46:34,426 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e04_1544715810515_0001_01_000005 Container Transitioned from ALLOCATED to ACQUIRED2018-12-13 15:46:34,449 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e04_1544715810515_0001_01_000005 Container Transitioned from ACQUIRED to RUNNING2018-12-13 15:46:35,431 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1544715810515_00012018-12-13 15:46:48,522 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e04_1544715810515_0001_01_000005 Container Transitioned from RUNNING to COMPLETED2018-12-13 15:46:50,409 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:run(315)) - 0x0cc62a3b no activities for 60000 ms, close active connection. Will reconnect next time when there are new requests.2018-12-13 15:46:50,479 INFO scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:releaseContainers(742)) - container_e04_1544715810515_0001_01_000005 doesn't exist. Add the container to the release request cache as it maybe on recovery.2018-12-13 15:46:50,479 INFO scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:completedContainer(669)) - Container container_e04_1544715810515_0001_01_000005 completed with event RELEASED, but corresponding RMContainer doesn't exist.
... View more
Labels:
- Labels:
-
Apache YARN
-
Docker
11-10-2018
01:51 AM
5 Kudos
Update: See here for a Docker on YARN sandbox solution: https://community.hortonworks.com/articles/232540/docker-on-yarn-sandbox.html Overview This guide has been tested with and without Kerberos on HDP 3.0.1. YARN offers a DNS service backed by Zookeeper for service discovery, but that can be challenging to setup. For a quickstart scenario, I will use docker swarm and an overlay network instead. If your environment is a single host, the networking is even simpler. This configuration is not recommended for production. I will use pssh to run commands in parallel across the cluster based on a hostlist file and a workerlist file. The hostlist file should contain every host in the cluster, and the workerlist file should include every node except for the one chosen to be the docker swarm master node. Prerequisites Install HDP 3.0.1 with or without Kerberos Install Docker on every host in the cluster #pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem -o 'StrictHostKeyChecking no'" "echo hostname"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum install -y yum-utils device-mapper-persistent-data lvm2"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo";
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo yum install -y docker-ce"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl start docker"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl enable docker" Configure docker swarm and create an overlay network ssh -i ~/cloudbreak.pem cloudbreak@<masternode> "sudo docker swarm init"
pssh -i -h workerlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo <output from last command: docker swarm join ...>"
ssh -i ~/cloudbreak.pem cloudbreak@<masternode> "sudo docker network create -d overlay --attachable yarnnetwork" If Kerberos is not enabled, create a default user for containers: pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo useradd dockeruser" Ambari In the YARN general settings tab, toggle the Docker Runtime button to "Enabled". This should change the following setting in Advanced YARN-Site: yarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor In Advanced YARN-Site, change the following, so all YARN docker containers use the overlay network we created by default: yarn.nodemanager.runtime.linux.docker.default-container-network=yarnnetwork
yarn.nodemanager.runtime.linux.docker.allowed-container-networks=host,none,bridge,yarnnetwork In Custom YARN-Site, add the following if kerberos is not enabled: yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user=dockeruser In Advanced Container Executor: Note that this allows any image from docker hub to be run. to limit the that docker images that can be run, set this property to a comma separated list of trusted registries. Docker images have the form <registry>/<imageName>:<tag>. docker_trusted_registries=* Alternatively, the following Ambari blueprint encapsulates these configurations: {
"configurations" : [
{
"yarn-site" : {
"properties" : {
"yarn.nodemanager.container-executor.class" : "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor",
"yarn.nodemanager.runtime.linux.docker.default-container-network" : "yarnnetwork",
"yarn.nodemanager.runtime.linux.docker.allowed-container-networks" : "host,none,bridge,yarnnetwork",
"yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user" : "dockeruser"
}
}
},
{
"container-executor" : {
"properties" : {
"docker_trusted_registries" : "library",
"docker_module_enabled" : "true"
}
}
}],
"host_groups" : [
{
"name" : "all",
"components" : [
{"name" : "HISTORYSERVER"},
{"name" : "NAMENODE"},
{"name" : "APP_TIMELINE_SERVER"},
{"name" : "NODEMANAGER"},
{"name" : "DATANODE"},
{"name" : "RESOURCEMANAGER"},
{"name" : "ZOOKEEPER_SERVER"},
{"name" : "SECONDARY_NAMENODE"},
{"name" : "HDFS_CLIENT"},
{"name" : "ZOOKEEPER_CLIENT"},
{"name" : "YARN_CLIENT"},
{"name" : "MAPREDUCE2_CLIENT"}
],
"cardinality" : "1"
}
],
"Blueprints" : {
"blueprint_name" : "yarn sample",
"stack_name" : "HDP",
"stack_version" : "3.0"
}
} Save the configurations and restart YARN Usage The YARN service REST API documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html The YARN app CLI documentation can be found here: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application_or_app Testing without Kerberos Place the following service definition into a file (e.g. yarnservice.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
]
} Submit the service with the following curl command. YARN should respond back with the applicationId The user will need write permission on their HDFS home directory(e.g. hdfs:/user/user1). ambari-qa has it by default. curl -X POST -H "Content-Type: application/json" http://<resource manager>:8088/app/v1/services?user.name=ambari-qa -d @yarnservice.json The service status can be viewed on the YARN UI, or through the REST APIs (python makes it easier to read): curl http://<resource manager>:8088/app/v1/services/redis-service?user.name=ambari-qa | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl -X DELETE http://<resource manager>:8088/app/v1/services/redis-service?user.name=ambari-qa Testing with Kerberos Create a kerberos principal of the format <username>/<hostname>@<realm> The hostname portion of the principal is required. Create a keytab for the principal and upload it to HDFS kadmin.local
>addprinc user1/host1.example.com@EXAMPLE.COM
...
>xst -k user1_host1.keytab user1/host1.example.com@EXAMPLE.COM
...
>exit
hadoop fs -put user1_host1.keytab hdfs:/user/user1/
hadoop fs -chown user1 hdfs:/user/user1/ Place the following service definition into a file (e.g. yarnservice.json) {
"name": "redis-service",
"version": "1.0.0",
"description": "redis example",
"components" :
[
{
"name": "redis",
"number_of_containers": 1,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
],
"kerberos_principal": {
"principal_name": "user1/host1.example.com@EXAMPLE.COM",
"keytab": "hdfs:/user/user1/user1_host1.keytab"
}
} Submit the service with the following curl command. YARN should respond back with the applicationId User1 will need permission to write into their HDFS home directory: (hdfs:/user/user1) curl --negotiate -u : -X POST -H "Content-Type: application/json" http://<resource manager>:8088/app/v1/services -d @yarnservice.json The service status can be viewed on the YARN UI, or through the REST APIs (python makes it easier to read): curl --negotiate -u : http://<resource manager>:8088/app/v1/services/redis-service | python -m json.tool The service name must be unique in the cluster. If you need to delete your service, the following command can be used: curl --negotiate -u : -X DELETE http://<resource manager>:8088/app/v1/services/redis-service Adding a local docker registry Each node in the cluster needs a way of downloading docker images when a service is run. It is possible to just use the public docker hub, but that is not always an option. Similar to creating a local repo for yum, a local registry can be created for Docker. Here is a quickstart that skips the security steps. In production, security best practices should be followed: On a master node, create an instance of the docker registry container. This will bind the registry to port 5000 on the host machine. docker run -d -p 5000:5000 --restart=always --name registry -v /mnt/registry:/var/lib/registry registry:2 Configure each machine to skip HTTPS checks: https://docs.docker.com/registry/insecure/. Here are commands for CentOS 7: pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo echo '{\"insecure-registries\": [\"<registryHost>:5000\"]}' | sudo tee --append /etc/docker/daemon.json"
pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo systemctl restart docker" The YARN service configuration, docker_trusted_registries, needs to be set to star ( * ) or needs to have this local registry in its list (e.g. library,<registryHost>:5000). Restart YARN Testing Local Docker Registry Build, tag, and push an image to the registry docker build -t myImage:1 .
docker tag myImage:1 <registryHost>:5000/myImage:1
docker push <registryHost>:5000/myImage:1 View image via REST curl <registryHost>:5000/v2/_catalog
curl <registryHost>:5000/v2/_catalog/myImage/tags/list Download image to all hosts in cluster (only necessary to demonstrate connectivity. Docker and YARN do this automatically) pssh -i -h hostlist -l cloudbreak -x "-i ~/cloudbreak.pem" "sudo docker pull <registryHost>:5000/myImage:1" Now, when an image with this registry prefix (e.g. <registryHost>:5000/myImage:1) is used in a YARN service definition, YARN will use the image from this local registry instead of trying to pull from the default public location.
... View more
- Find more articles tagged with:
- docker
- Hadoop Core
- How-ToTutorial
- YARN
- yarn-service
Labels:
11-10-2018
01:08 AM
That worked. I just uploaded one of the keytabs into hdfs:/user/user1/user1_host1.keytab and updated the "kerberos_principal" section as follows. Is there a plan to remove the hostname requirement? Thanks, @Gour Saha! "kerberos_principal": { "principal_name": "user1/host1.example.com@EXAMPLE.COM", "keytab": "hdfs:/user/user1/user1_host1.keytab" }
... View more
11-10-2018
12:59 AM
Turns out, the guide that I was following was outdated. I tried this again on a different cluster and it worked perfectly with the Ambari default for yarn.nodemanager.linux-container-executor.cgroups.mount-path ("/cgroup")
... View more
11-10-2018
12:35 AM
I have been able to run Dockerized YARN services on a kerberized HDP 3.0.1 cluster using the following service configuration. However, this requires a service principal to be created for every node in the cluster in the format user1/hostname@EXAMPLE.COM. Additionally, the keytab for each of these principals must be distributed to their respective hosts. Is there a way around this? {
"name": "hello-world",
"version": "1.0.0",
"description": "hello world example",
"components" :
[
{
"name": "hello",
"number_of_containers": 5,
"artifact": {
"id": "library/redis",
"type": "DOCKER"
},
"launch_command": "",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE": "true"
}
}
}
],
"kerberos_principal": {
"principal_name": "user1/_HOST@EXAMPLE.COM",
"keytab": "file:///etc/security/keytabs/user1.keytab"
}
} If I leave out the "kerberos_principal" section completely, I receive this error at service submission: {"diagnostics":"Kerberos principal or keytab is missing."} If I use a principal without the "_HOST" portion, I receive this error at service submission: {"diagnostics":"Kerberos principal (user1@EXAMPLE.COM) does not contain a hostname."} If the keytab does not exist on the worker node, I receive this error in the application log: org.apache.hadoop.service.ServiceStateException: java.io.IOException:
SASL is configured for registry, but neither keytab/principal nor
java.security.auth.login.config system property are specified
... View more
Labels:
- Labels:
-
Apache YARN
-
Docker
11-08-2018
08:32 PM
Comma separating the launch_command fields and setting YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true in the service definition rather than yarn-site.env allowed me to use my custom entrypoint as expected in HDP 3.0.1. Strangely, exporting YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true in the yarn-site.env worked fine in my Apache Hadoop 3.1.1 environment, but not in my Ambari-installed HDP 3.0.1 environment. The stdout and stderr redirects are not included in the docker run command in the Apache release. Must be some other setting involved, but I am past my issue, so I will leave it here. Thanks, @Tarun Parimi!
... View more
11-05-2018
08:29 PM
@Tarun Parimi Thanks for the tip. I set "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE" in the yarn-env.sh, because I did not know I could set it on the service itself. This is a better solution, but does not address the main issue. Unfortunately, comma separating the input parameters did not help either. Here is the output: Docker run command: /usr/bin/docker run --name=container_e06_1541194419811_0015_01_000002 --user=1015:1015 --net=yarnnetwork -v /hadoop/yarn/local/filecache:/hadoop/yarn/local/filecache:ro -v /hadoop/yarn/local/usercache/shjelmfelt/filecache:/hadoop/yarn/local/usercache/admin/filecache:ro -v /hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002:/hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002 -v /hadoop/yarn/local/usercache/admin/appcache/application_1541194419811_0015:/hadoop/yarn/local/usercache/admin/appcache/application_1541194419811_0015 --cgroup-parent=/hadoop-yarn/container_e06_1541194419811_0015_01_000002 --cap-drop=ALL --cap-add=SYS_CHROOT --cap-add=MKNOD --cap-add=SETFCAP --cap-add=SETPCAP --cap-add=DAC_READ_SEARCH --cap-add=FSETID --cap-add=SYS_PTRACE --cap-add=CHOWN --cap-add=SYS_ADMIN --cap-add=AUDIT_WRITE --cap-add=SETGID --cap-add=NET_RAW --cap-add=FOWNER --cap-add=SETUID --cap-add=DAC_OVERRIDE --cap-add=KILL --cap-add=NET_BIND_SERVICE --hostname=myappcontainers-0.myapp.admin.EXAMPLE.COM --group-add 1015 --env-file /hadoop/yarn/local/nmPrivate/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002/docker.container_e06_1541194419811_0015_01_0000022942289027318724111.env 172.26.224.119:5000/myapp:1.0-SNAPSHOT input1 input2 1>/hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002/stdout.txt 2>/hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002/stderr.txt
Received input: input1,input2 1>/hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002/stdout.txt 2>/hadoop/yarn/log/application_1541194419811_0015/container_e06_1541194419811_0015_01_000002/stderr.txt
... View more
11-02-2018
11:11 PM
When running a Dockerized YARN service, YARN is not providing the correct input arguments. The service is defined as follows. The entry point in the docker file is ["java", "-jar", "myapp.jar"]. For debugging, it outputs the incoming arguments and exits. {
"name": "myapp",
"version": "1.0.0",
"description": "myapp",
"components" :
[
{
"name": "myappcontainers",
"number_of_containers": 1,
"artifact": {
"id": "myapp:1.0-SNAPSHOT",
"type": "DOCKER"
},
"launch_command": "input1 input2",
"resource": {
"cpus": 1,
"memory": "256"
}
}
]
} Here is the output from YARN: Launching docker container...
Docker run command: /usr/bin/docker run --name=container_e06_1541194419811_0006_01_000026 --user=1015:1015 --net=yarnnetwork -v /hadoop/yarn/local/filecache:/hadoop/yarn/local/filecache:ro -v /hadoop/yarn/local/usercache/admin/filecache:/hadoop/yarn/local/usercache/admin/filecache:ro -v /hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026:/hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026 -v /hadoop/yarn/local/usercache/admin/appcache/application_1541194419811_0006:/hadoop/yarn/local/usercache/admin/appcache/application_1541194419811_0006 --cgroup-parent=/hadoop-yarn/container_e06_1541194419811_0006_01_000026 --cap-drop=ALL --cap-add=SYS_CHROOT --cap-add=MKNOD --cap-add=SETFCAP --cap-add=SETPCAP --cap-add=DAC_READ_SEARCH --cap-add=FSETID --cap-add=SYS_PTRACE --cap-add=CHOWN --cap-add=SYS_ADMIN --cap-add=AUDIT_WRITE --cap-add=SETGID --cap-add=NET_RAW --cap-add=FOWNER --cap-add=SETUID --cap-add=DAC_OVERRIDE --cap-add=KILL --cap-add=NET_BIND_SERVICE --hostname=myappcontainers-3.myapp.admin.EXAMPLE.COM --group-add 1015 --env-file /hadoop/yarn/local/nmPrivate/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026/docker.container_e06_1541194419811_0006_01_0000264842430064377299975.env myapp:1.0-SNAPSHOT input1 input2 1>/hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026/stdout.txt 2>/hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026/stderr.txt
Received input: input1 input2 1>/hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026/stdout.txt 2>/hadoop/yarn/log/application_1541194419811_0006/container_e06_1541194419811_0006_01_000026/stderr.txt The program itself is given the redirection commands. Is there a way to disable this behavior? The only two workarounds I have identified are:
Change the ENTRYPOINT in the dockerfile to be ["sh", "-c"] and the launch_command to "java -jar myjar.jar" Change the program to use or ignore the "1>" and "2>" inputs Both of these solutions require repackaging in a way that does not conform to Docker best practice.
... View more
Labels:
- Labels:
-
Apache YARN
-
Docker
11-02-2018
07:14 PM
1 Kudo
I was able to work around this error by running: sudo mkdir /sys/fs/cgroup/blkio/hadoop-yarn
sudo chown -R yarn:yarn /sys/fs/cgroup/blkio/hadoop-yarn I then received a very similar message for "/sys/fs/cgroup/memory/hadoop-yarn" and "/sys/fs/cgroup/cpu/hadoop-yarn". After creating these directories as well, the node managers came up. Here is the full work-around that was run on each node: sudo mkdir /sys/fs/cgroup/blkio/hadoop-yarn
sudo chown -R yarn:yarn /sys/fs/cgroup/blkio/hadoop-yarn
sudo mkdir /sys/fs/cgroup/memory/hadoop-yarn
sudo chown -R yarn:yarn /sys/fs/cgroup/memory/hadoop-yarn
sudo mkdir /sys/fs/cgroup/cpu/hadoop-yarn
sudo chown -R yarn:yarn /sys/fs/cgroup/cpu/hadoop-yarn
... View more
11-02-2018
07:10 PM
I am receiving the following message from each node manager when attempting to start YARN after enabling docker. What is the root cause? 2018-11-02 18:28:50,974 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:checkVersion(1662)) - Loaded NM state version info 1.2
2018-11-02 18:28:51,174 INFO resources.ResourceHandlerModule (ResourceHandlerModule.java:initNetworkResourceHandler(182)) - Using traffic control bandwidth handler
2018-11-02 18:28:51,193 WARN resources.CGroupsBlkioResourceHandlerImpl (CGroupsBlkioResourceHandlerImpl.java:checkDiskScheduler(101)) - Device vda does not use the CFQ scheduler; disk isolation using CGroups will not work on this partition.
2018-11-02 18:28:51,199 INFO resources.CGroupsHandlerImpl (CGroupsHandlerImpl.java:initializePreMountedCGroupController(410)) - Initializing mounted controller blkio at /sys/fs/cgroup/blkio/hadoop-yarn
2018-11-02 18:28:51,199 INFO resources.CGroupsHandlerImpl (CGroupsHandlerImpl.java:initializePreMountedCGroupController(420)) - Yarn control group does not exist. Creating /sys/fs/cgroup/blkio/hadoop-yarn
2018-11-02 18:28:51,200 ERROR nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:init(323)) - Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:blkio Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/blkio/hadoop-yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsBlkioResourceHandlerImpl.bootstrap(CGroupsBlkioResourceHandlerImpl.java:123)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
2018-11-02 18:28:51,205 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state INITED
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
... 3 more
2018-11-02 18:28:51,207 ERROR nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(936)) - Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:933)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1013)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:324)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
... 3 more
... View more
Labels:
- Labels:
-
Apache YARN
-
Docker
10-30-2018
12:47 AM
This error was resolved by explicitly specifying the content type as JSON: curl ... -H "Content-Type: application/json"
... View more
10-30-2018
12:39 AM
I receive the following generic error when attempting to POST a YARN service definition using the api: "/app/v1/services": 2018-10-27 09:33:24,440 WARN webapp.GenericExceptionHandler (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
javax.ws.rs.WebApplicationException
at com.sun.jersey.server.impl.uri.rules.TerminatingRule.accept(TerminatingRule.java:66)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
... View more
Labels:
- Labels:
-
Apache YARN
-
Docker
10-04-2018
08:11 PM
2 Kudos
Most
data movement use cases do not require a “shuffle phase” for redistributing
FlowFiles across a NiFi cluster, but there are few cases where it is useful.
For example: ListFile
-> FetchFile ListHDFS
-> FetchHDFS ListFTP
-> FetchFTP GenerateTableFetch
-> ExecuteSQL GetSQS
-> FetchS3 In
each case, the flow starts with a processor that generates tasks to run (e.g.
filenames) followed by the actual execution of those tasks. To scale, tasks
need to run on each node in the NiFi cluster, but for consistency, the task
generation should only run on the primary node. The solution is to introduce a
shuffle (aka load balancing) step in between task generation and task
execution. Processors
can be configured to run on the primary node by going to “View
Configuration”-> “Scheduling” and selecting “Primary node only” under
“Execution”. The
shuffle step is not an explicit component on the NiFi canvas, but rather the
combination of a Remote Input Port and a Remote Process Group pointing at the
local cluster. FlowFiles that are sent to the Remote Process Group will be load
balanced over Site-to-Site and come back into the flow via the Remote Input
Port. Under “Manage Remote Ports” on the Remote Process Group there are batch
settings that help control the load balancing. Here
are two example flows that use this design pattern:
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- design
- How-ToTutorial
- NiFi
Labels:
05-14-2018
08:51 PM
7 Kudos
A quick glance at NiFi’s 252+ processors shows that it can
solve a wide array of use cases out of the box. What is not immediately obvious is the flexibility that its attributes and expression language
can provide. This allows it to quickly,
easily, and efficiently solve complex use cases that would require significant customization to solve in other solutions. For example, sending all of the incoming data to both Kafka
and HDFS while sending 10% to a dev environment and a portion to a partner
system based on the content of the data (e.g. CustomerName=ABC). These more
complex routing scenarios are easily accommodated using UpdateAttribute,
RouteOnAttribute, and RouteOnContent. Another example of NiFi’s flexibility is the ability to
multiplex data flows. In traditional ETL systems, the schema is tightly coupled
to the data as it moves between systems, because transformations occur in
transit. In more modern ELT scenarios, the data is often loaded into the
destination with minimal transformations before the complex transformation step
is kicked off. This has many advantages and allows NiFi to focus on the EL
portion of the flow. When focused on EL, there is far less of a need for the
movement engine to be schema aware since it is general focused on simple
routing, filtering, format translation, and concatenation. One common scenario
is when loading data from many Kafka topics into their respective HDFS
directories and/or Hive tables with only simple transformations. In traditional
systems, this would require one flow per topic, but by parameterizing flows, one flow can be used for all topics. In the image below you can see the configurations and
attributes that make this possible. The ConsumeKafka processor can use a list
of topics or a regular expression to consume from many topics at once. Each
FlowFile (e.g. batch of Kafka messages) has an attribute added called
"kafka.topic" to identify its source topic. Next, in order to load streaming data into HDFS or Hive, it
is recommended to use MergeContent to combine records into large files (e.g.
every 1GB or every 15 minutes). In MergeContent, setting the “correlation
attribute” configuration to “kafka.topic” ensures that only records from the
same kafka topic are combined (similar to a group-by clause). After the files
are merged, the “directory” configuration in HDFS can be parameterized (e.g.
/myDir/${kafka.topic}) in order to load the data into the correct directory
based on the kafka topic name. Note that this diagram includes a retry and notify on
failure process group. This type of solution is highly recommended for
production flows. More information can be found here. This example could easily be extended to include file format
translation (e.g. ConverAvroToORC), filtering (e.g. RouteOnContent),
kafka-topic to HDFS-directory mapping (e.g. UpdateAttribute). It can even
trigger downstream processing (e.g. ExecuteSparkInteractive, PutHiveQL,
ExecuteStreamCommand, etc.) or periodically update metrics and logging
solutions such as Graphite, Druid, or Solr. Of course, this solution also
applies to many more data stores than just Kafka and HDFS. Overall, parameterizing flows in NiFi for multiplexing can
reduce complexity for EL use cases and simplify administration. This design is
straightforward to implement and uses core NiFi features. It is also easily
extended to a variety of use cases.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- How-ToTutorial
- NiFi
Labels:
02-27-2018
08:25 PM
Hi Mitthu, Here is an article I wrote about handling failures in NiFi: https://community.hortonworks.com/articles/76598/nifi-error-handling-design-pattern-1.html It describes how to retry failures X times, then send an email, then wait for administrative input. This might help you address the requirements of your solution. You could also add a PutEmail processor on the "Success" relationship to send an email after processing succeeds.
... View more
02-27-2018
08:19 PM
Hi Sirisha, Have you tried this guide? It has a nice description of the process. https://community.hortonworks.com/articles/59635/one-way-trust-mit-kdc-to-active-directory.html
... View more
02-27-2018
08:17 PM
Have you tried this guide? It has a nice description on how to run spark jobs on a schedule with Oozie. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/ch_oozie-spark-action.html
... View more
02-27-2018
08:14 PM
While the multi-tenant features of HDP (e.g. YARN capacity scheduler, Ranger policies, HDFS quotas, etc.) could be used to combine Dev/QA/Prod environments into a single cluster, it is generally not recommended. Managing a single cluster instead of three seems easier on the surface, but it is really not worth it. First of all, where are developers going to test against new versions if you only have one cluster? Combining Dev and QA may be an option, but is more of an organizational decision. A configuration I like is Prod, DR/Ad-hoc, and Dev/QA. Most companies require a DR environment in sync with production. By making that DR environment read-only, you can run exploratory analytics and/or data science workloads using resources that would otherwise sit idle. Additionally, pulling the lower priority and unpredictable workloads out of production reduces the risk of missing SLAs. Of course, all of this is use case dependent, and your mileage may vary. The best thing about "big data" technologies is how customizable and broadly applicable they are, and the worst thing is how customizable and broadly applicable they are 🙂
... View more
02-27-2018
07:53 PM
I would suggest using the HandleHTTPRequest and HandleHTTPResponse processors instead of ListenHTTP. They provide much more flexibility. Here is a nice guide from @Chris Gambino: https://community.hortonworks.com/articles/55080/create-a-restful-for-nifi-walmart-case-study.html
... View more
01-05-2017
10:20 PM
8 Kudos
Many process groups have a success and failure output relationship. A common question is how to best handle these failures. For invalid data, it makes sense to output the flow files to an HDFS directory for analysis, but not when failure was caused by an external dependency (e.g. HDFS, Kafka, FTP). A simple solution might be to loop the failures back to retry, but then it may fail repeatedly without notifying an administrator. A better solution would be to retry three times, then, if it still has not succeeded, an administrator should be notified and the flow file should wait before trying again. This gives the administrator time to resolve the issue and the ability to quickly and easily retry the flow files. Below (and attached) is a simple process group that implements this logic. The failed flow files come in through the input port. The UpdateAttribute processor sets the retryCount attribute to one or increments it if it has already been set. The RouteOnAttribute processor determines whether the retryCount attribute is over a threshold (e.g. three). If it is not over the threshold, the flow file is routed out through the retry port. If it is over the threshold, the flow file is routed to a PutEmail processor. The last UpdateAttribute processor should be disabled at all times so that the flowfiles will queue up after the PutEmail processor to wait for the administrator to resolve the issue. Once the issue is resolved, the administrator simply enables, starts, stops, and disables this last processor. The retryCount attribute will be set to zero and the flow file will go out through the retry port. If the flow file still does not succeed, it will go back into this process group and the administrator will get another email. Note that a merge content processor could be used to reduce the number of emails, if necessary.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- design
- FAQ
- NiFi
Labels:
11-08-2016
07:59 PM
1 Kudo
You will have to option to select which services you want to install, similar to HDP. You can select only Zookeeper and NiFi, but I would recommend LogSearch, Ambari Metrics, and Ranger as they really augment the solution.
... View more
09-02-2016
11:48 PM
1 Kudo
The PutEmail processor can be configured to run only on the primary node. The configuration is available under the scheduling tab of the processor settings window.
... View more
04-25-2016
08:06 PM
I received the same error when using get_json_object. Replacing the dollar sign with \044 worked. For example: select get_json_object(jsonField,'$.subelement') from jsonTable;
Looks like Atlas is not handling '$' in hive queries correctly.
... View more
04-11-2016
09:30 PM
6 Kudos
An HTTPS endpoint for receiving data in NiFi requires two
processors and two controller services:
HandleHttpRequest, HandleHttpResponse,
SSLContextService, and HttpContextMap.
Note: The HandleHttpRequest processor in NiFi 0.6 does not have
functional client authentication, but a fix will be implemented in the next
version (see NIFI-1753).
SSL Context Service
This service can be created during the set up of
the HandleHttpRequest processor
The following properties should be set:
Name = AgentSSLContextService Keystore Filename = <Path to Keystore> Keystore Password = <Keystore Password> Keystore Type = JKS SSL Protocol = TLS
Since Client Authentication will be disabled in
the HandleHttpRequest processor, the Truststore configurations are not necessary.
HTTP Context Map
This service can be created during the set up of
the HandleHttpRequest processor
The name should be set to AgentSSLContextMap
HandleHttpRequest
This processor receives HTTP requests The following properties should be set:
Listening Port = 4444 SSL Context Service = AgentSSLContextService HTTP Context Map = AgentSSLContextMap Allow GET = false Allow POST = true Allow PUT = false Allow DELETE = false Allow HEAD = false Allow OPTIONS = false Client Authentication = No Authentication
HandleHttpResponse
This processor sends an HTTP response to the
client
For this example, only one is needed with a
status code set to 200.
The HTTP Context Map must be set to
AgentSSLContextMap in order to link it to the HandleHttpRequest processor
Sample Client
The Java client will need a Truststore
containing the certificate used by the SSLContextService.
The following Java code sample demonstrates the
process for posting data to the NiFi flow:
//Set up SSL properties
System.setProperty("javax.net.ssl.trustStoreType","jks");
System.setProperty("javax.net.ssl.trustStore","agent_truststore.ts");
System.setProperty("javax.net.ssl.trustStorePassword","hadoop");
//System.setProperty("javax.net.debug","ssl"); //Verbose SSL logging
//Uncomment for client authentication
//System.setProperty("javax.net.ssl.keyStoreType","jks");
//System.setProperty("javax.net.ssl.keyStore","agent_keystore.jks");
//System.setProperty("javax.net.ssl.keyStorePassword","hadoop");
//Set up
connectionSSLSocketFactorysslsocketfactory = (SSLSocketFactory) SSLSocketFactory.getDefault();
URLurl = new URL("https://"+NiFiHostname+":"+port);
HttpsURLConnectionconn = (HttpsURLConnection)url.openConnection();
conn.setSSLSocketFactory(sslsocketfactory);
// Send POST
conn.setRequestMethod("POST");
conn.setReadTimeout(5000);
conn.setConnectTimeout(5000);
//Note: In NiFi HTTP headers are added as attributes with the following pattern:
//http.headers.{headerName}
conn.setRequestProperty("attr1","value");
conn.setDoOutput(true);
DataOutputStream wr = new DataOutputStream(conn.getOutputStream());
wr.writeBytes("test123");
wr.flush();
wr.close();
//Get Response Code
intcode = conn.getResponseCode();
System.out.println(code);
conn.disconnect();
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- How-ToTutorial
- java
- NiFi
- wire-encryption
Labels:
02-13-2016
06:07 AM
8 Kudos
Repo Description DockerDoop is an extension of the Hortonworks University solution for running multi-node HDP clusters in a single VM using Docker. It simplifies HDP dev/test workloads such as those related to HA, Kerberos, failover scenarios, and multi-node demos. This is especially useful for blueprint testing, because a blank cluster can be destroyed and recreated in less than 1 minute. 3-6 node clusters can be run with only 6-8GB of RAM and multiple clusters can exist in the same VM. All HDP ports on all nodes are externally accessible (i.e. from the workstation). Repo Info Github Repo URL https://github.com/SamHjelmfelt/DockerDoop Github account name SamHjelmfelt Repo name DockerDoop
... View more
- Find more articles tagged with:
- docker
- Sandbox & Learning
- utilities
Labels: