Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2111 | 03-02-2018 01:19 AM | |
3437 | 03-02-2018 01:04 AM | |
2342 | 08-02-2017 05:40 PM | |
2334 | 07-17-2017 05:35 PM | |
1697 | 07-10-2017 02:49 PM |
03-08-2017
02:21 AM
@Prasanna G When you import the VMWare Sandbox and start the VM, the Docker container within that VM is started automatically. You don't normally have to start it. Is that process not working properly for you? Or are you using the native Docker Sandbox inside your own Linux VM in VMWare?
... View more
03-06-2017
03:52 PM
@Yogesh Sharma Can you share your Elasticsearch index template that defines your field mappings? What type of analyzers and tokenizers are you using? You can define the mappings on a per-index basis: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/analysis.html Are you using the standard analyzer? https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
... View more
03-05-2017
06:58 PM
4 Kudos
Objective
This tutorial will walk you through the process of using Ansible to deploy Hortonworks Data Platform (HDP) on Amazon Web Services (AWS). We will use the ansible-hadoop Ansible playbook from ObjectRocket to do this. You can find more information on that playbook here: ObjectRocket Ansible-Hadoop
This tutorial is part 2 of a 2 part series. Part 1 in the series will show you how to use Ansible to create instances on Amazon Web Services (AWS). Part 1 is avaiablle here: HCC Article Part 1
This tutorial was created as a companion to the Ansible + Hadoop talk I gave at the Ansible NOVA Meetup in February 2017. You can find the slides to that talk here: SlideShare
Prerequisites
You must have an existing AWS account.
You must have access to your AWS Access and Secret keys.
You are responsible for all AWS costs incurred.
You should have 3-6 instances created in AWS. If you completed Part 1 of this series, then you have an easy way to do that.
Scope
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6 and 10.12.3
Amazon Web Services
Anaconda 4.1.6 (Python 2.7.12)
Ansible 2.1.3.0
git 2.10.1
Steps
Create python virtual environment
We are going to create a Python virtual environment for installing the required Python modules. This will help eliminate module version conflicts between applications.
I prefer to use Continuum Anaconda for my Python distribution. Therefore the steps for setting up a python virtual environment will be based on that. However, you can use standard python and the virtualenv command to do something similar.
To create a virtual environment using Anaconda Python, you use the conda create command. We will name our virtual environment ansible-hadoop . The the following command: conda create --name ansible-hadoop python will create our virtual environment with the name specified. You should see something similar to the following:
$ conda create --name ansible-hadoop python
Fetching package metadata .......
Solving package specifications: ..........
Package plan for installation in environment /Users/myoung/anaconda/envs/ansible-hadoop:
The following NEW packages will be INSTALLED:
openssl: 1.0.2k-1
pip: 9.0.1-py27_1
python: 2.7.13-0
readline: 6.2-2
setuptools: 27.2.0-py27_0
sqlite: 3.13.0-0
tk: 8.5.18-0
wheel: 0.29.0-py27_0
zlib: 1.2.8-3
Proceed ([y]/n)? y
Linking packages ...
cp: /Users/myoung/anaconda/envs/ansible-hadoop:/lib/libcrypto.1.0.0.dylib: No such file or directory
mv: /Users/myoung/anaconda/envs/ansible-hadoop/lib/libcrypto.1.0.0.dylib-tmp: No such file or directory
[ COMPLETE ]|################################################################################################| 100%
#
# To activate this environment, use:
# $ source activate ansible-hadoop
#
# To deactivate this environment, use:
# $ source deactivate
#
Switch python environments
Before installing python packages for a specific development environment, you should activate the environment. This is done with the command source activate <environment> . In our case the environment is the one we just created, ansible-hadoop . You should see something similar to the following:
$ source activate ansible-hadoop
As you can see there is no output to indicate if we were successful in changing our environment.
To verify, you can use the conda info --envs command list the available environments. The active environment will have a * . You should see something similar to the following:
$ conda info --envs
# conda environments:
#
ansible-hadoop * /Users/myoung/anaconda/envs/ansible-hadoop
root /Users/myoung/anaconda
As you can see, the ansible-hadoop environment has the * which means it is the active environment.
If you want to remove your python virtual environment, you can use the following command: conda remove --name <environment> --all . If you want to remove the environment we just created you should see something similar to the following:
$ conda remove --name ansible-hadoop --all
Package plan for package removal in environment /Users/myoung/anaconda/envs/ansible-hadoop:
The following packages will be REMOVED:
openssl: 1.0.2k-1
pip: 9.0.1-py27_1
python: 2.7.13-0
readline: 6.2-2
setuptools: 27.2.0-py27_0
sqlite: 3.13.0-0
tk: 8.5.18-0
wheel: 0.29.0-py27_0
zlib: 1.2.8-3
Proceed ([y]/n)? y
Unlinking packages ...
[ COMPLETE ]|################################################################################################| 100%
HW11380:test myoung$ conda info --envs
# conda environments:
#
root * /Users/myoung/anaconda
Install Python modules in virtual environment
The ansible-hadoop playbook requires a specific version of Ansible. You need to install Ansible 2.1.3.0 before using the playbook. You can do that easily with the following command:
pip install ansible==2.1.3.0
Using a Python Virtual environment allows us to easily use Ansbile 2.1.3.0 for our playbook without impacting the default Python versions.
Clone ansible-hadoop github repo
You need to clone the ansible-hadoop github repo to a working directory on your computer. I typically do this in ~/Development.
$ cd ~/Development
$ git clone https://github.com/objectrocket/ansible-hadoop.git
You should see something similar to the following:
$ git clone https://github.com/objectrocket/ansible-hadoop.git
Cloning into 'ansible-hadoop'...
remote: Counting objects: 3879, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 3879 (delta 1), reused 0 (delta 0), pack-reused 3873
Receiving objects: 100% (3879/3879), 6.90 MiB | 0 bytes/s, done.
Resolving deltas: 100% (2416/2416), done.
Configure ansible-hadoop
You should make the ansible-hadoop repo directory your current working directory. There are a few configuration items we need to change.
$ cd ansible-hadoop
You should already have 3-6 instances available in AWS. You will need the public IP address of those instances.
Configure ansible-hadoop/inventory/static
We need to modify the inventory/static file to include the public IP addresses of our AWS instances. We need to assign master and slave nodes in the file. The instances are all the same configuration by default, so it doesn't matter which IP addresses you put for master and slave.
The default version of the inventory/static file should look similar to the following:
[master-nodes]
master01 ansible_host=192.168.0.2 bond_ip=172.16.0.2 ansible_user=rack ansible_ssh_pass=changeme
#master02 ansible_host=192.168.0.2 bond_ip=172.16.0.2 ansible_user=root ansible_ssh_pass=changeme
[slave-nodes]
slave01 ansible_host=192.168.0.3 bond_ip=172.16.0.3 ansible_user=rack ansible_ssh_pass=changeme
slave02 ansible_host=192.168.0.4 bond_ip=172.16.0.4 ansible_user=rack ansible_ssh_pass=changeme
[edge-nodes]
#edge01 ansible_host=192.168.0.5 bond_ip=172.16.0.5 ansible_user=rack ansible_ssh_pass=changeme
I'm going to be using 6 instances in AWS. I will put 3 instances as master servers and 3 instances as slave servers. There are a couple of extra options in the default file we don't need. The only values we need are:
hostname : which should be master, slave or edge with a 1-up number like master01 and slave01
ansible_host : should be the AWS public IP address of the instances
ansible_user : should be the username you SSH into the instance using the private key.
You can easily get the public IP address of your instances from the AWS console. Here is what mine looks like:
If you followed the part 1 tutorial, then the username for your instances should be centos . Edit your inventory/static . You should have something similar to the following:
[master-nodes]
master01 ansible_host=#.#.#.# ansible_user=centos
master03 ansible_host=#.#.#.# ansible_user=centos
master03 ansible_host=5#.#.#.# ansible_user=centos
[slave-nodes]
slave01 ansible_host=#.#.#.# ansible_user=centos
slave02 ansible_host=#.#.#.# ansible_user=centos
slave03 ansible_host=#.#.#.# ansible_user=centos
#[edge-nodes]
Your public IP addresses will be different. Also note the #[edge-nodes] value in the file. Because we are not using any edge nodes, we should comment that host group line in the file.
Once you have all of your edits in place, save the file.
Configure ansible-hadoop/ansible.cfg
There are a couple of changes we need to make to the ansible.cfg file. This file provides overall configuration settings for Ansible. The default file in the playbook should look similar to the following:
[defaults]
host_key_checking = False
timeout = 60
ansible_keep_remote_files = True
library = playbooks/library/cloudera
We need to change the library line to be library = playbooks/library/site_facts . We will be deploying HDP which requires the site_facts module. We also need to tell Ansible where to find the private key file for connecting to the instances.
Edit the ansible.cfg file. You should modify the file to be similar to the following:
[defaults]
host_key_checking = False
timeout = 60
ansible_keep_remote_files = True
library = playbooks/library/site_facts
private_key_file=/Users/myoung/Development/ansible-hadoop/ansible.pem
Note the path of your private_key_file will be different. Once you have all of your edits in place, save the file.
Configure ansible-hadoop/group_vars/hortonworks
This step is optional. The group_vars/hortonworks file allows you to change how HDP is deployed. You can modify the version of HDP and Ambari. You can modify which components are installed. You can also specify custom repos and Ambari blueprints.
I will be using the default file, so there are no changes made.
Run bootstrap_static.sh
Before installing HDP, we need to ensure our OS configuration on the AWS instances meet the installation prerequisites. This includes things like ensuring DNS and NTP are working and all of the OS packages are updated. These are tasks that you often find people doing manually. This would obviously be tedious across 100s or 1000s of nodes. It would also introduce a far greater number of opportunties for human error. Ansible makes it incredibly easy to perform these kinds of tasks.
Running the bootstrap process is as easy as bash bootstrap_static.sh . This script essentially runs ansible-playbook -i inventory/static playbooks/boostrap.yml for you. This process will typically take 7-10 minutes depending on the size of the instances you selected.
When the script is finished, you should see something similar to the following;
PLAY RECAP *********************************************************************
localhost : ok=3 changed=2 unreachable=0 failed=0
master01 : ok=21 changed=15 unreachable=0 failed=0
master03 : ok=21 changed=15 unreachable=0 failed=0
slave01 : ok=21 changed=15 unreachable=0 failed=0
slave02 : ok=21 changed=15 unreachable=0 failed=0
slave03 : ok=21 changed=15 unreachable=0 failed=0
As you can see, all of the nodes had 21 total task performed. Of those tasks, 15 tasks required modifications to be compliant with the desire configuration state.
Run hortonworks_static.sh
Now that the bootstrap process is complete, we can install HDP. The hortonworks_static.sh script is all you have to run to install HDP. This script essentially runs ansible-playbook -i inventory/static playbooks/hortonworks.yml for you. The script installs the Ambari Server on the last master node in our list. In my case, the last master node is master03. The script also installs the Ambari Agent on all of the nodes. The installation of HDP is performed by submitting an request to the Ambari Server API using an Ambari Blueprint.
This process will typically take 10-15 minutes depending on the size of the instances you selected, the number of master nodes and the list of HDP components you have enabled.
If you forgot to install the specific version of Ansible, you will likely see something similar to the following:
TASK [site facts processing] ***************************************************
fatal: [localhost]: FAILED! => {"failed": true, "msg": "ERROR! The module sitefacts.py dnmemory=\"31.0126953125\" mnmemory=\"31.0126953125\" cores=\"8\" was not found in configured module paths. Additionally, core modules are missing. If this is a checkout, run 'git submodule update --init --recursive' to correct this problem."}
PLAY RECAP *********************************************************************
localhost : ok=4 changed=2 unreachable=0 failed=1
master01 : ok=8 changed=0 unreachable=0 failed=0
master03 : ok=8 changed=0 unreachable=0 failed=0
slave01 : ok=8 changed=0 unreachable=0 failed=0
slave02 : ok=8 changed=0 unreachable=0 failed=0
slave03 : ok=8 changed=0 unreachable=0 failed=0
To resolve this, simply perform the pip install ansible==2.1.3.0 command within your Python virtual environment. Now you can rerun the bash hortonworks_static.sh script.
The last task of the playbook is to install HDP via an Ambari Blueprint. It is normal to see something similar to the following:
TASK [ambari-server : Create the cluster instance] *****************************
ok: [master03]
TASK [ambari-server : Wait for the cluster to be built] ************************
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (180 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (179 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (178 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (177 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (176 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (175 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (174 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (173 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (172 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (171 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (170 retries left).
Once you see 3-5 of the retry messages, you can access the Ambari interface via your web browser. The default login is admin and the default password is admin . You should see something similar to the following:
Click on the Operations icon that shows 10 operations in progress. You should see something similar to the following:
The installation task each takes between 400-600 seconds. The start task each take between 20-300 seconds. The master servers typically take longer to install and star than the slave servers.
When everything is running properly, you should see something similar to this:
If you look back at your terminal window, you should see something similar to the following:
ok: [master03]
TASK [ambari-server : Fail if the cluster create task is in an error state] ****
skipping: [master03]
TASK [ambari-server : Change Ambari admin user password] ***********************
skipping: [master03]
TASK [Cleanup the temporary files] *********************************************
changed: [master03] => (item=/tmp/cluster_blueprint)
changed: [master03] => (item=/tmp/cluster_template)
changed: [master03] => (item=/tmp/alert_targets)
ok: [master03] => (item=/tmp/hdprepo)
PLAY RECAP *********************************************************************
localhost : ok=5 changed=3 unreachable=0 failed=0
master01 : ok=8 changed=0 unreachable=0 failed=0
master03 : ok=30 changed=8 unreachable=0 failed=0
slave01 : ok=8 changed=0 unreachable=0 failed=0
slave02 : ok=8 changed=0 unreachable=0 failed=0
slave03 : ok=8 changed=0 unreachable=0 failed=0
Destroy the cluster
You should remember that you will incur AWS costs while the cluster is running. You can either shutdown or terminate the instances. If you want to use the cluster later, then use Ambari to stop all of the services before shutting down the instances.
Review
If you successfully followed along with this tutorial, you should have been able to easy deploy Hortonworks Data Platform 2.5 on AWS using the Ansible playbook. The process to deploy the cluster typicall takes 10-20 minutes.
For more information on how the instance types and number of master nodes impacted the installation time, review the Ansbile + Hadoop slides I linked at the top of the article.
... View more
03-04-2017
06:05 PM
4 Kudos
Objective
This tutorial will walk you through the process of using Ansible, an agent-less automation tool, to create instances on AWS. The Ansible playbook we will use is relatively simple; you can use it as a base to experiment with more advanced features. You can read more about Ansible here: Ansible.
Ansible is written in Python and is installed as a Python module on the control host. The only requirement for the hosts managed by Ansible is the ability to login with SSH. There is no requirement to install any software on the host managed by Ansible.
If you have never used Ansible, you can become more familiar with it by going through some basic tutorials. The following two tutorials are a good starting point:
Automate All Things With Ansible: Part One
Automate All Things With Ansible: Part Two
This tutorial is part 1 of a 2 part series. Part 2 in the series will show you how to use Ansible to deploy Hortonworks Data Platform (HDP) on Amazon Web Services (AWS).
This tutorial was created as a companion to the Ansible + Hadoop talk I gave at the Ansible NOVA Meetup in February 2017. You can find the slides to that talk here: SlideShare
You can get a copy of the playbook from this tutorial here: Github
Prerequisites
You must have an existing AWS account.
You must have access to your AWS Access and Secret keys.
You are responsible for all AWS costs incurred.
Scope
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6 and 10.12.3
Amazon Web Services
Anaconda 4.1.6 (Python 2.7.12)
Ansible 2.0.0.2 and 2.1.3.0
Steps
Create a project directory
You need to create a directory for your Ansible playbook. I prefer to create my project directories in ~/Development.
mkdir ~/Development/ansible-aws
cd ~/Development/ansible-aws
Install Ansible module
If you use the Anaconda version of Python, you already have access to Ansible. If you are not using Anaconda, then you can usually install Ansible using the following command:
pip install ansible
To read more about how to install Ansible: Ansible Installation
Overview of our Ansible playbook
Our playbook is relatively simple. It consists of a single inventory file, single group_vars file and a single playbook file. Here is the layout of the file and directory structure:
+- ansible-aws/
|
+- group_vars/
| +- all
|
+- inventory/
| +- hosts
|
+- playbooks/
| +- ansible-aws.yml
group_vars/all
You can use variables in your playbooks using the {{variable name}} syntax. These variables are populated based on values stored in your variable files. You can explicitly load variable files in your playbooks.
However, all playbooks will automatically load the variables in the group_vars/all variable file. The all variable file is loaded for all hosts regardless of the groups the host may be in. In our playbook, we are placing our AWS configuration values in the all file.
Edit the group_vars/all file. Copy and paste the following text into the file:
aws_access_key: <enter AWS access key>
aws_secret_key: <enter AWS secret key>
key_name: <enter private key file alias name>
aws_region: <enter AWS region>
vpc_id: <enter VPC ID>
ami_id: ami-6d1c2007
instance_type: m4.2xlarge
my_local_cidr_ip: <enter cidr_ip>
aws_access_key : You need to enter your AWS Access key
aws_secret_key : You need to enter your AWS Secret key
key_name : The alias name you gave to the AWS private key which you will use to SSH into the instances. In my case I created a key called ansible .
aws_region : The AWS region where you want to deploy your instances. In my case I am using us-east-1 .
vpc_id : The specific VPC in which you want to place your instances.
ami_id : The specific AMI you want to deploy for your instances. The ami-6d1c2007 AMI is a CentOS 7 image.
instance_type : The type of AWS instance. For deploying Hadoop, I recommend at least m4.2xlarge . A faster alternative is c4.4xlarge .
my_local_cidr_ip : Your local computer's CIDR IP address. This is used for creating the security rules that allow your local computer to access the instances. An example CIDR format is 192.168.1.1/32 . Make sure this set to your computer's public IP address.
After you have entered your appropriate settings, save the file.
inventory/hosts
Ansible requires a list of known hosts against which playbooks and tasks are run. We will tell Ansible to use a specific host file with the -i inventory/hosts parameter.
Edit the inventory/hosts file. Copy and paste the following text into the file:
[local]
localhost ansible_python_interpreter=/Users/myoung/anaconda/bin/python
[local] : Defines the group the host belongs to. You have the option for a playbook to run against all hosts, a specific group of hosts, or an individual host. This AWS playbook only runs on your local computer. That is because it uses the AWS APIs to communicate with AWS.
localhost : This is the hostname. You can list multiple hosts, 1 per line under each group heading. A host can belong to multiple groups.
ansible_python_interpreter : Optional entry that tells Ansible which specific version of Python to run. Because I am using Anaconda Python, I've included that setting here.
After you have entered your appropriate settings, save the file.
playbooks/ansible-aws.yml
The playbook is where we define the list of tasks we want to perform. Our playbook will consist of 2 tasks. The first task is to create a specific AWS Security Group. The second tasks is to create a specific configuration of 6 instances on AWS.
Edit the file playbooks/ansible-aws.yml . Copy and paste the following text into the file:
---
# Basic provisioning example
- name: Create AWS resources
hosts: localhost
connection: local
gather_facts: False
tasks:
- name: Create a security group
ec2_group:
name: ansible
description: "Ansible Security Group"
region: "{{aws_region}}"
vpc_id: "{{vpc_id}}""
aws_access_key: "{{aws_access_key}}"
aws_secret_key: "{{aws_secret_key}}"
rules:
- proto: all
cidr_ip: "{{my_local_cidr_ip}}"
- proto: all
group_name: ansible
rules_egress:
- proto: all
cidr_ip: 0.0.0.0/0
register: firewall
- name: Create an EC2 instance
ec2:
aws_access_key: "{{aws_access_key}}"
aws_secret_key: "{{aws_secret_key}}"
key_name: "{{key_name}}"
region: "{{aws_region}}"
group_id: "{{firewall.group_id}}"
instance_type: "{{instance_type}}"
image: "{{ami_id}}"
wait: yes
volumes:
- device_name: /dev/sda1
volume_type: gp2
volume_size: 100
delete_on_termination: true
exact_count: 6
count_tag:
Name: aws-demo
instance_tags:
Name: aws-demo
register: ec2
This playbook uses the Ansible ec2 and ec2_group modules. You can read more about the options available to those modules here:
ec2
ec2_group
The task to create the EC2 security group creates a group named ansible . It defines 2 ingress rules and 1 egress rule for that security group. The first ingress rule is to allow all inbound traffic from any host in the security group ansible . The second ingress rule is to allow all inbound traffic from your local computer IP address. The egress rule allows all traffic out from all of the hosts.
The task to create the EC2 instances creates 6 hosts because of the exact_count setting. It creates a tag called hadoop-demo on each of the instances and uses that tag to determine how many hosts exists. You can chose to use smaller number of hosts.
You can specify volumes to mount on each of the instances. The default volume size is 8 GB and is too small for deploying Hadoop later. I recommend setting the size to at least 100 GB as above. I also recommend you set delete_on_termination to true . This will tell AWS to delete the storage after you have deleted the instances. If you do not do this, then storage will be kept and you will be charged for it.
After you have entered your appropriate settings, save the file.
Running the Ansible playbook
Now that our 3 files have been created and saved with the appropriate settings, we can run the playbook. To run the playbook, you use the ansible-playbook -i inventory/hosts playbooks/ansible-aws.yml command. You should see something similar to the following:
$ ansible-playbook -i inventory/hosts playbooks/ansible-aws.yml
PLAY [Create AWS resources] ****************************************************
TASK [Create a security group] *************************************************
changed: [localhost]
TASK [Create an EC2 instance] **************************************************
changed: [localhost]
PLAY RECAP *********************************************************************
localhost : ok=2 changed=2 unreachable=0 failed=0
The changed lines indicate that Ansible found a configuration that needed to be modify to be consistent with our requested state. For the security group task, you would see this if your security group didn't exist or if you had a different set of ingress or egress rules. For the instance tasks, you would see this if there were less than or more than 6 hosts tagged as aws-demo .
Check AWS console.
If you check your AWS console, you should be able to confirm the instances are created. You should see something similar to the following:
Review
If you successfully followed along with this tutorial, you have created a simple Ansible playbook with 2 tasks using the ec2 and ec2_group Ansible modules. The playbook creates an AWS security group and instances which can be used later for deploying HDP on AWS.
... View more
02-28-2017
04:40 AM
3 Kudos
Objective
This tutorial is designed to walk you through the process of creating a MiniFi flow to read data from a Sense HAT sensor on a Raspberry Pi 3. The MiniFi flow will push data to a remote NiFi instance running on your computer. The NiFi instance will push the data to Solr.
While there are other tutorials and examples of using NiFi/MiniFi with a Raspberry Pi, most of those tutorials tend to use a more complicated sensor implementation. The Sense HAT is
very easy to install and use. Prerequisites
You should have a Raspberry Pi 3 Model B: Raspberry Pi 3 Model B I recommend a 16+GB SD card for your Raspberry Pi 3. Don't forget to expand the filesystem after the OS is installed: raspi-config
You should have a Sense HAT: Sense HAT You should already have installed the Sense HAT on your Raspberry Pi 3.
You should already have installed Raspbian Jessie Lite on your Raspberry Pi 3 SD card: Raspbian Jessie Lite The instructions for installing a Raspberry Pi OS can be found here: Raspberry PI OS Install You may be able to use the NOOBS operating system that typically ships with the Raspbery Pi. However, the Raspbian Lite OS will ensure the most system resources available to MiniFi or NiFi.
You should have enabled SSH on your Raspberry Pi: Enable SSH
You should have enabled WiFi on your Raspberry Pi (or use wired networking): Setup WiFi
You should have NiFi 1.x installed and working on your computer: NiFi
You should have the Java MiniFi Toolkit 0.1.0 installed and working on your computer: MiniFi ToolKit
You should have downloaded Solr 6.x on your computer: Solr Download Scope
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6 and 10.12.3
MiniFi 1.0.2.1.1.0-2.1
MiniFi Toolkit 0.1.0
NiFi 1.1.1
Solr 6.4.1
Java JDK 1.8 Steps Connect to Raspberry Pi using SSH
If you have completed all of the prerequisites, then you should be able to easily SSH into your Raspberry Pi. On my Mac, I connect using:
ssh pi@raspberrypi
The default username is
pi and the password is raspberry .
If you get an unknown host or DNS error, then you need to specify the IP address of the Raspberry Pi. You can get that by logging directly into the Raspberry Pi console.
Now run the
ifconfig command.
You should see something similar to the following:
pi@raspberrypi:~ $ ifconfig
eth0 Link encap:Ethernet HWaddr b8:27:eb:60:ff:5b
inet6 addr: fe80::ec95:e79b:3679:5159/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
wlan0 Link encap:Ethernet HWaddr b8:27:eb:35:aa:0e
inet addr:192.168.1.204 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::21f6:bf0f:5f9f:d60d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:17280 errors:0 dropped:11506 overruns:0 frame:0
TX packets:872 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3414755 (3.2 MiB) TX bytes:133472 (130.3 KiB)
If you are using WiFi, then look at the
wlan0 device. If you are using wired ethernet, then look at the eth0 device. Now you can connect using the ip address you found.
ssh pi@192.168.1.204 .
Your IP address will vary. Update Raspberry Pi packages
It's always a good idea to ensure your installed packages are up to date. Raspbian Lite is based on Debian. Therefore you need use
apt-get to update and install packages.
First, we need to run
sudo apt-get update to update the list of available packages and versions. You should see something similar to the following:
pi@raspberrypi:~ $ sudo apt-get update
Get:1 http://mirrordirector.raspbian.org jessie InRelease [14.9 kB]
Get:2 http://archive.raspberrypi.org jessie InRelease [22.9 kB]
Get:3 http://mirrordirector.raspbian.org jessie/main armhf Packages [8,981 kB]
Get:4 http://archive.raspberrypi.org jessie/main armhf Packages [145 kB]
Get:5 http://archive.raspberrypi.org jessie/ui armhf Packages [57.6 kB]
Get:6 http://mirrordirector.raspbian.org jessie/contrib armhf Packages [37.5 kB]
Get:7 http://mirrordirector.raspbian.org jessie/non-free armhf Packages [70.3 kB]
Get:8 http://mirrordirector.raspbian.org jessie/rpi armhf Packages [1,356 B]
Ign http://archive.raspberrypi.org jessie/main Translation-en_US
Ign http://archive.raspberrypi.org jessie/main Translation-en
Ign http://archive.raspberrypi.org jessie/ui Translation-en_US
Ign http://archive.raspberrypi.org jessie/ui Translation-en
Ign http://mirrordirector.raspbian.org jessie/contrib Translation-en_US
Ign http://mirrordirector.raspbian.org jessie/contrib Translation-en
Ign http://mirrordirector.raspbian.org jessie/main Translation-en_US
Ign http://mirrordirector.raspbian.org jessie/main Translation-en
Ign http://mirrordirector.raspbian.org jessie/non-free Translation-en_US
Ign http://mirrordirector.raspbian.org jessie/non-free Translation-en
Ign http://mirrordirector.raspbian.org jessie/rpi Translation-en_US
Ign http://mirrordirector.raspbian.org jessie/rpi Translation-en
Fetched 9,330 kB in 17s (542 kB/s)
Reading package lists... Done
Now we can update our installed packages using
sudo apt-get dist-upgrade . You should see something similar to the following:
pi@raspberrypi:~ $ sudo apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
bind9-host libbind9-90 libdns-export100 libdns100 libevent-2.0-5 libirs-export91 libisc-export95 libisc95 libisccc90
libisccfg-export90 libisccfg90 libjasper1 liblwres90 libpam-modules libpam-modules-bin libpam-runtime libpam0g login
passwd pi-bluetooth raspberrypi-sys-mods raspi-config vim-common vim-tiny
24 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 4,767 kB of archives.
After this operation, 723 kB disk space will be freed.
Do you want to continue? [Y/n] y
The list of packages and versions that need to be updated will vary. Enter
y to update the installed packages. Install additional Raspberry Pi packages
We need to install additional packages to interact with the Sense HAT sensor and run MiniFi.
You access the Sense HAT libraries using Python. Therefore the first package we need to install is Python.
sudo apt-get install python
The second package we need to install is the libraries for the Sense HAT device.
sudo apt-get install sense-hat
We will be using the Java version of MiniFi. Therefore the third package we need to install is the Oracle JDK 8.
sudo apt-get install oracle-java8-jdk Verify Sense HAT functionality
Before we use MiniFi to collect any data, we need to ensure we can interact with the Sense HAT sensor. We will create a simple Python script to display a message on our Sense HAT.
Edit the file display_message.py using
vi display_message.py . Now copy and paste the following text into your text editor (remember to go into insert mode first):
from sense_hat import SenseHat
sense = SenseHat()
sense.show_message("Hello")
Save the script using
:wq! . Run this script using python display_message.py . You should see the word Hello scroll across the display of the Sense HAT in white text.
Now let's test reading the temperature from the Sense Hat. Edit the file get_temp.py using
vi get_temp.py . Now copy and paste the following text into your text editor (remember to go into insert mode first):
from sense_hat import SenseHat
sense = SenseHat()
t = sense.get_temperature()
print('Temperature = {0:0.2f} C'.format(t))
Save the script using
:wq! . Run the script using python get_temp.py . You should something similar to the following (your values will vary):
pi@raspberrypi:~ $ python get_temp.py
Temperature = 31.58 C
For our MiniFi use case, we will be looking at temperature, pressure, and humidity data. We will not use the Sense HAT display for MiniFi, so we'll only print the data to the console.
You can read more about the Sense HAT functions here:
Sense HAT API
Now let's create a script which prints all 3 sensor values. Edit the file get_environment.py using
vi get_environment.py . Copy and paste the following text into your text editor (remember to go into insert mode first):
from sense_hat import SenseHat
import datetime
sense = SenseHat()
t = sense.get_temperature()
p = sense.get_pressure()
h = sense.get_humidity()
print('Hostname = rapsberrypi')
print('DateTime = ' + datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
print('Temperature = {0:0.2f} C'.format(t))
print('Pressure = {0:0.2f} Millibars'.format(p))
print('Humidity = {0:0.2f} %rH'.format(h))
Save the script using
:wq! . Run the script using python get_environment.py . You should something similar to the following (your values will vary):
Hostname = rapsberrypi
DateTime = 2017-02-27T21:20:55Z
Temperature = 32.90 C
Pressure = 1026.53 Millibars
Humidity = 25.36 %rH
As you can see from the script, we are printing our date output using UTC time via the
utcnow() function. We also need to ensure the data format is consumable by Solr. That is why we are using %Y-%m-%dT%H:%M:%SZ which is a format Solr can parse.
Our MiniFi flow will use the
ExecuteProcess to run the script. So we need to create a simple bash script to run the get_environment.py file. Edit the file get_environment.sh using vi get_environment.sh . Copy and paste the following text into your text editor (remember to go into insert mode first):
python /home/pi/get_environment.py
Save the script using
:wq! . Make sure the script is executable by running chmod 755 get_environment.sh . Let's make sure the bash script works ok. Run the script using ./get_environment.sh . You should something similar to the following (your values will vary):
Hostname = rapsberrypi
DateTime = 2017-02-27T21:20:55Z
Temperature = 32.90 C
Pressure = 1026.53 Millibars
Humidity = 25.36 %rH
Install MiniFi
We are going to install MiniFi on the Raspberry Pi. First download the the MiniFi release.
wget http://public-repo-1.hortonworks.com/HDF/2.1.1.0/minifi-1.0.2.1.1.0-2-bin.tar.gz Now you can extract it using tar xvfz minifi-1.0.2.1.1.0-2-bin.tar.gz .
Now we are ready to create our NiFi and MiniFi flows. Start NiFi
On your computer (not on the Raspberry Pi), start NiFi if you have not already done so. You do this by running
<nifi installation dir>/bin/nifi.sh start . It may take a few minutes before NiFi is fully started. You can monitor the logs by running tail -f <nifi installation dir>/log/nifi.app.log .
You should see something similar to the following when the UI is ready:
2017-02-26 14:10:01,199 INFO [main] org.eclipse.jetty.server.Server Started @40057ms
2017-02-26 14:10:01,695 INFO [main] org.apache.nifi.web.server.JettyServer NiFi has started. The UI is available at the following URLs:
2017-02-26 14:10:01,695 INFO [main] org.apache.nifi.web.server.JettyServer http://127.0.0.1:9091/nifi
2017-02-26 14:10:01,695 INFO [main] org.apache.nifi.web.server.JettyServer http://192.168.1.186:9091/nifi
2017-02-26 14:10:01,697 INFO [main] org.apache.nifi.BootstrapListener Successfully initiated communication with Bootstrap
2017-02-26 14:10:01,697 INFO [main] org.apache.nifi.NiFi Controller initialization took 11161419754 nanoseconds.
Now you should be able to access NiFi in your browser by going to
<hostname>:8080/nifi . The default port is 8080 . If you have a port conflict, you can change the port.
You should see a blank NiFi canvas similar to the following:
NiFi Blank Canvas
Setup Solr
Before we start on our NiFi flow, let's make sure Solr is running. We are going to use schemaless mode. You can easily start Solr using
solr -e schemaless .
You should see something similar to the following:
$ bin/solr -e schemaless
Creating Solr home directory /Users/myoung/Downloads/solr-6.4.1/example/schemaless/solr
Starting up Solr on port 8983 using command:
bin/solr start -p 8983 -s "example/schemaless/solr"
Waiting up to 180 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=49659). Happy searching!
Copying configuration to new core instance directory:
/Users/myoung/Downloads/solr-6.4.1/example/schemaless/solr/gettingstarted
Creating new core 'gettingstarted' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=gettingstarted&instanceDir=gettingstarted
{
"responseHeader":{
"status":0,
"QTime":1371},
"core":"gettingstarted"}
Solr schemaless example launched successfully. Direct your Web browser to http://localhost:8983/solr to visit the Solr Admin UI
As you can see, Solr created a collection called
gettingstarted . That is the name of the collection our NiFi PutSolrContentStream will use. Create NiFi flow
Now we need to create our NiFi flow that will receive data from MiniFi. Input Port
The MiniFi flow will send data to a
Remote Process Group . The Remote Process Group requires an Input Port . From the NiFi menu, drag the Input Port icon to the canvas.
In the
Add Port dialog that is displayed, type a name for your port. I used From Raspberry Pi . You should see something similar to the following:
Click the blue
ADD button. ExtractText
From the NiFi menu, drag the
Processor icon to the canvas. In the Filter box, enter extract . You should see something similar to the following:
Select the
ExtractText processor. Click on the blue ADD button to add the processor to the canvas.
Now we need to configure the
ExtractText processor. Right click on the processor and select the Configure menu option.
On the
SETTINGS tab of the ExtractText processor, you should check the unmatched box under Automatically Terminate Relationships . This will drop any records which we fail to extract text from. You should see something similar to the following:
On the
PROPERTIES tab of the ExtractText processor, there are a few changes we need to make.
First, we want want to set
Enable Multiline Mode to true . This allows the Regular Expressions to match across multiple lines. This is important because our data is coming in as multiline data.
Second, we want to set
Include Capture Group 0 to false . Each Regular Expression we are using has only a single group. If we left this value to true, each field we extract would have duplicate values which would go unused as <attribute name>.0 .
Third, we need to add additional fields to the processor which allows us to define our Regular Expressions. If you click the
+ icon in the upper right corner of the dialog, you should see something similar to the following:
We are going to add a property called
hostname . This will hold the value from the line Hostname = in the data. Click the blue OK button. Now you should see another dialog where you enter the regular expression. You should see something similar to the following:
Enter the following Regular Expression:
Hostname = (\w+)
We need to repeat this process for each of the other data elements coming from the Raspberry Pi. You should have the following extra fields defined as separate fields:
property: hostname
value: Hostnamne = (\w+)
property: datetime
value: DateTime = (\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}:\d{2}Z)
property: temperature
value: Temperature = (\d+\.\d+) C
property: humidity
value: Humidity = (\d+\.\d+) %rH
property: pressure
value: Pressure = (\d+\.\d+) Millibars
When you have entered each of these properties, you should see something similar to the following:
Click the blue
APPLY button to save the changes. AttributesToJSON
From the NiFi menu, drag the
Processor icon to the canvas. In the Filter box, enter attributes . You should see something similar to the following:
Select the
AttributesToJSON processor. Click on the blue ADD button to add the processor to the canvas.
Now we need to configure the
AttributesToJSON processor. Right click on the processor and select the Configure menu option.
On the
PROPERTIES tab of the AttributesToJSON processor, there are a few changes we need to make.
For the
Attributes List property, we need to provide a comma-separated list of attributes we want the processor to pass on. Click inside the Value box next to Attributes List . Enter the following value:
hostname,datetime,temperature,pressure,humidity
For the
Destination property, set the value to flowfile-content . We need the values to be in the flowfile content itself as JSON which is needed by the PutSolrContentStream processor. Otherwise the flowfile content will contain the raw data (not JSON) coming from the Raspberry Pi. This will cause Solr to throw errors because it is not able to parse request.
You should see something similar to the following:
Click the blue
APPLY button to save the changes. PutSolrContentStream
From the NiFi menu, drag the
Processor icon to the canvas. In the Filter box, enter solr . You should see something similar to the following:
Select the
PutSolrContentStream processor. Click on the blue ADD button to add the processor to the canvas.
Now we need to configure the
PutSolrContentStream processor. Right click on the processor and select the Configure menu option.
On the
SETTINGS tab of the PutSolrContentStream processor, you should check the connection_failure , failure , and success boxes under Automatically Terminate Relationships . Since this is the end of the flow, we can terminate everything. You could expand on this by retrying failures, or logging errors to a text file.
You should see something similar to the following:
On the
PROPERTIES tab of the PutSolrContentStream processor, we need to make a few changes.
Set the
Solr Type property to Standard . We don't need to run SolrCloud for our demo.
Set the
Solr Location to http://192.168.1.186:8983/solr/gettingstarted . You should use the IP address of your computer. When we start Solr up, we'll be using the gettingstarted collection, so it's part of the URL. If we were using SolrCloud, we put put the collection name in the Collection property instead.
The first set of properties should look similar to the following:
Now we need to add fields for indexing in Solr. Click the
+ icon in the upper right corner of the processor. The Add Property dialog will be displayed. For the first field, enter f.1 and click the ADD button. For the value enter hostname_s:/hostname . The hostname_s part of the value says to store the content in the Solr field called hostname_s , which uses the dynamic schema to treat this field as a string. The /hostname part of the value says to pull the value from the root of the JSON where the JSON node is called hostname .
We need to repeat this process for each of the other data elements coming from the Raspberry Pi. You should have the following fields defined as separate fields:
property: f.1
value: hostname_s:/hostname
property: f.2
value: timestamp_dts:/datetime
property: f.3
value: temperature_f:/temperature
property: f.4
value: pressure_f:/pressure
property: f.5
value: humidity_f:/humidity
Click the blue
APPLY button to save the changes. Connector Processors
Now that we have our processors on the canvas, we need to connect them. Drag the connection icon from the
Input Port processor to the ExtractText processor.
Drag the connection icon from the
ExtractText processor to the AttributesToJSON processor.
Drag the connection icon from the
AttributesToJSON processor to the PutSolrContentStream processor.
You should have something that looks similar to the following:
Create MiniFi flow
Now we can create our MiniFi flow. ExecuteProcess
The first thing we need to do is add a processor to execute the bash script we created on the Raspberry Pi.
Drag the
Processor icon to the canvas. Enter execute in the Filter box. You should see something similar to the following:
Select the
ExecuteProcess processor. Click on the blue ADD button to add the processor to the canvas.
Now we need to configure the
ExecuteProcess processor. Right click on the processor and select the Configure menu option.
On the
SETTINGS tab you should check the success box under Automatically Terminate Relationships . You should see something similar to the following:
On the
Scheduling tab we want to set the Run Schedule to 5 sec . This will run the processor every 5 seconds. You should see something similar to the following;
On the
Properties tab we want to set the Command to /home/pi/get_environment.sh . This assumes you created the scripts in the /home/pi directory on the Raspberry Pi.
Click the blue
APPLY button to save the changes. Remote Process Group
Now we need to add a
Remote Process Group to our canvas. This is how the MiniFi flow is able to send data to Nifi. Drag the Remote Process Group icon to the canvas.
For the
URL enter the URL you use to access your NiFi UI. In my case that is http://192.168.1.186:9090/nifi . Remember the default port for NiFi is 8080 . For the Transport Protocol select HTTP . You can leave the other settings as defaults. You should see something similar to the following:
Click the blue
ADD button to add the Remote Process Group to the canvas. Create Connection
Now we need to create a connection between our ExecuteProcess processor and our Remote Process Group on the canvas.
Hover your mouse over the
ExecuteProcess processor. Click on the circle arrow icon and drag from the processor to the Remote Process Group . Save Template
We need to save the MiniFi portion of the flow as a template. Select the
ExecuteProcess , Remote Process Group and the connection between them using the shift key to allow multi-select.
Click on the
Create Template icon (second icon from the right on the top row) in the Operate Box on the canvas. It looks like the following:
The
Create Template dialog will be displayed. Give your template a name. I used rasbperrypi and click the blue CREATE button.
Now click on the main Nifi Menu button in the upper right corner of the UI. You should see something like the following:
Now click the
Templates options. This will open the NiFi Templates dialog. You will see a list of templates you have created. You should see something similar to the following:
Now find the template you just created and click on the
Download button on the right hand side. This will save a copy of the flowfile in xml format on your local computer. Convert NiFi Flow to MiniFi Flow
We need to convert the xml flowfile NiFi generated into a yml file that MiniFi uses. We will be using the minifi-toolkit to do this.
We need run the minifi-toolkit transform command. The first option is the location of the NiFi flowfile you downloaded. The second option is the location where to write out the MiniFi flowfile. MiniFi expects the flowfile name to be
config.yml
Run the transform command. You should see something similar to the following:
$ /Users/myoung/Downloads/minifi-toolkit-0.1.0/bin/config.sh transform ~/Downloads/raspberry.xml ~/Downloads/config.yml
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
MiNiFi Toolkit home: /Users/myoung/Downloads/minifi-toolkit-0.1.0
No validation errors found in converted configuration.
Copy MiniFi Flow to Raspberry Pi
Now we need to copy the flowfile to the Raspberry pi. You can easily do that using the
scp command. The config.yml file we generated needs to go in the /home/pi/minifi-1.0.2.1.1.0-2/conf/ directory.
You should see something similar to the following:
$ scp ~/Downloads/minifi.yml pi@raspberrypi:/home/pi/minifi-1.0.2.1.1.0-2/conf/config.yml
pi@raspberrypi's password:
minifi.yml 100% 1962 721.1KB/s 00:00
Start MiniFi
Now that the flowfile is in place, we can start MiniFi. You do that using the
minifi.sh script with the start option. Remember that MiniFi will be running on the Raspberry Pi, not on your computer.
You should see something similar to the following:
$ /home/pi/minifi-1.0.2.1.1.0-2/minifi.sh start
minifi.sh: JAVA_HOME not set; results may vary
Bootstrap Classpath: /home/pi/minifi-1.0.2.1.1.0-2/conf:/home/pi/minifi-1.0.2.1.1.0-2/lib/bootstrap/*:/home/pi/minifi-1.0.2.1.1.0-2/lib/*
Java home:
MiNiFi home: /home/pi/minifi-1.0.2.1.1.0-2
Bootstrap Config File: /home/pi/minifi-1.0.2.1.1.0-2/conf/bootstrap.conf
Now MiniFi should be running on your Raspberry Pi. If you run into any issues, look at the logs in <minifi directory>/logs/minifi-app.log. Start NiFi flow
Now that everything else is in place, we should be able to start our NiFi flow. Start the 4 NiFi processors, not the two MiniFi parts of the flow. If everything is working properly, you should start seeing records in Solr. Dashboard
You can easily add Banana to Solr to create a dashboard. Here is an example: Review
If you successfully followed along with this tutorial, you should have MiniFi collecting data from your Sense HAT sensor on your Raspberry Pi. The MiniFi flow should be sending that data to NiFi on your computer which then sends to the data to Solr.
... View more
Labels:
02-24-2017
12:43 PM
@Mourad Chahri That is not currently a supported configuration. There are common components in HDF and HDP (Kafka, Storm, Ranger, to name a few). If you attempt to install both stacks on the same cluster of servers, you will experience conflicts.
... View more
02-24-2017
12:30 PM
@Mourad Chahri You can not currently install HDP and HDF on the same cluster because there are conflicts with Ambari and the service configurations. You must install HDF and HDP on different servers.
... View more
02-24-2017
12:23 PM
@suresh krish Is this in a Sandbox or another cluster? What version of HDP are you using? In the middle of your log output you can see that Ranger is attempting to use the SolrCloudCli.sh to create the Ranger audit collection. That process fails as it isn't able to see any live Solr servers. Do you have the Ambari-Infra service running? That service is the Solr that Ranger uses by default.
... View more
02-24-2017
12:17 PM
1 Kudo
@rahul gulati See this article on best practices for deploying HDP on Azure: https://community.hortonworks.com/articles/22376/recommendations-for-microsoft-azure-hdp-deployment-1.html For most production clusters, we typically recommend enabling HA for services. That requires that you have at a minimum 2 master servers, although 3 would be better. You need 3 Zookeeper instances. While you can put Zookeeper on the data nodes, it would be better to put Zookeeper on the master nodes.
... View more