Community Articles

Find and share helpful community-sourced technical articles.
Welcome to the upgraded Community! Read this blog to see What’s New!


What we’re doing:

  1. Committing to <2 hours setup time
  2. Deploying a matched pair of 3 node HDF clusters into two different zones on AWS EC2
  3. Configuring some basic performance optimisations in EC2/NiFi
  4. Setting up site-to-site to work with EC2's public/private FQDNs

What you will need:

  1. AWS account with access / credits to deploy 3x EC2 machines in two different regions
  2. Approx 80Gb SSD disk per machine, for 480GB total
  3. 3x Elastic IPs per region
  4. (Preferable) iTerm or similar ssh client with broadcast capability
  5. You should use two different AWS regions for your clusters, I’ll refer to them as regionA and regionB


  1. We’re going to use a script to setup OS pre-requisites and deploy HDF on the server nodes to save time, as this article is specific to AWS EC2 setup and not generic HDF deployment
  2. We’re not setting up HDF Security, though we will restrict access to our nodes to specific IPs
  3. This is not best practice for all use cases, particularly on security; you are advised to take what you learn here and apply it intelligently to your own environment needs and use cases
  4. You don't have to use Elastic IPs, but they'll persist through environment reboots and therefore prevent FQDN changes causing your services to need reconfiguration


Part 1: EC2 Public IPs, Instance & OS setup, and HDF packages deployment

Create Public IPs

  1. Login to your AWS:EC2 account and select regionA
  2. Select the ‘Elastic IPs’ interface
  3. Allocate 3x new addresses, note them down
  4. Switch to regionB and repeat

Launch EC2 instances

  1. Run the ‘Launch Instance’ wizard
  2. From ‘AWS Marketplace’, select ‘CentOS 7 (x86_64) - with Updates HVM’
  3. Select ‘t2.xlarge’ (4x16)
    1. This is the minimum that will reasonably run the HDF stack, choose bigger if you prefer.
    2. Same for Node count and disk size below
  4. Set Number of Instances to ‘3’
  5. Set root volume to 20Gb General Purpose SSD
    1. Add 3x New Volumes of the same configuration13609-picture1.png
  6. Set the ‘Name’ tag to something meaningful
  7. Create a new security group called ‘HDF Performance Test’
    1. Set a rule to allow ‘All traffic’ to ‘My IP’
    2. Add 3x new rules to allow ‘All Traffic’ to the Elastic IPs in regionB (the other region) that you created earlier
    3. Add the local subnet for internode communication


    4. Optional: You could create rules for the specific ports required, at the cost of a much longer configuration. Ports 22 (ssh), 8080 (Ambari), 9090-9092 (NiFi) should be sufficient for a no-security install
  8. Review your configuration options, and hit Launch
  9. Select your ssh key preference; either use an existing key or create a new one and download it
  10. Once the launcher completes, go to Elastic IPs
  11. Associate each Instance with an Elastic IP
  12. Note down the Private and Public DNS (FQDN) for each instance; the Public should have similar values to the Elastic IPs you allocatedWhile the deployment finishes, repeat Steps 1- 13 in the other region to create your matching sets of EC2 instances

OS Setup and package installs

  1. (optional) Launch iTerm, and open a broadcast tab for every node
    1. Otherwise issue commands to each node sequentially (much slower)
    yum update -y
    mkfs -t ext4 /dev/xvdb
    mkfs -t ext4 /dev/xvdc
    mkfs -t ext4 /dev/xvdd
    mkdir /mnt/nifi_content
    mkdir /mnt/nifi_flowfile
    mkdir /mnt/nifi_prov
    mount /dev/xvdb /mnt/nifi_content/
    mount /dev/xvdc /mnt/nifi_flowfile/
    mount /dev/xvdd /mnt/nifi_prov/
    echo "/dev/xvdb /mnt/nifi_content ext4 errors=remount-ro 0 1" >> /etc/fstab
    echo "/dev/xvdc /mnt/nifi_flowfile ext4 errors=remount-ro 0 1" >> /etc/fstab
    echo "/dev/xvdd /mnt/nifi_prov ext4 errors=remount-ro 0 1" >> /etc/fstab
  2. We’re going to use a script to do a default install of Ambari for HDF, as we’re interested in looking at NiFi rather than overall HDF setup
    2. Follow steps 1 – 3 in this guide, reproduced here for convenience
    3. Tip: using the command ‘curl’ will provide you the FQDN of the current ssh session for convenience
  3. Run these commands as root on the first node, which I assume will be running Ambari server
export hdf_ambari_mpack_url=
yum install -y git python-argparse
git clone
export install_ambari_server=true
ambari-server install-mpack --mpack=${hdf_ambari_mpack_url} --purge --verbose  #enter 'yes' to purge at prompt
ambari-server restart 

  1. Assuming Ambari will run on the first node in the cluster, run these commands as root on every other node
export ambari_server= <FQDN of host where ambari-server will be installed>;
export install_ambari_server=false
curl -sSL | sudo -E sh ;

Deploy and Configure multi-node HDF Clusters on AWS EC2

HDF Cluster Deployment

  1. Open a browser to port 8080 of the public FQDN of the Ambari Server
  2. Login using the defaults credentials of admin/admin
  3. Select ‘Launch Install Wizard’
  4. Name your cluster
  5. Accept the default versions and repos
  6. Fill in these details:
    1. Provide the list of Private FQDNs in the ‘Target Hosts’ panel
    2. Select ‘Perform manual registration on hosts’ and accept the warning
  7. Wait while hosts are confirmed, then hit Next
    1. If this step fails, check you provided the Private FQDNs, and not the Public FQDNs
  8. Select the following services: ZooKeeper, Ambari Infra, Ambari Metrics, NiFi
  9. Service layout
    1. Accept the default Master Assignment
    2. Use the ‘+’ key next to the NiFi row to add NiFi instances until you have one on each Node
  10. Unselect the Nifi Certificate Service and continue
  11. Customize Services
    1. Provide a Grafana Admin Password in the ‘Ambari Metrics’ tab
    2. Provide Encryption Passphrases in the NiFi tab, they must be at least 12 characters
  12. When you hit Next you may get Configuration Warnings from Ambari; resolve any Errors and continue
  13. Hit Deploy and monitor the process
  14. Repeat steps 1 – 13 on the other cluster

NiFi Service Configuration for multi-node cluster on AWS EC2

  1. Login to the cluster
  2. In the NiFi service panel, go to the Configs tab
  3. Enter ‘repository’ in the top right filter box, and change the following
    1. Nifi content repository default dir = /mnt/nifi_content
    2. Nifi flowfile repository dir = /mnt/nifi_flowfile
    3. Nifi provenance repository default dir = /mnt/nifi_prov
  4. Enter ‘mem’ in the filter box:
    1. Set Initial memory allocation = 2048m
    2. Set Max memory allocation = 8096m
  5. Enter ‘port’ in the filter box:
    1. Note down the NiFi HTTP port (Non-SSL), default is 9090
    2. Set nifi.remote.input.socket.port = 9092
  6. Save your changes
  7. Enter ‘’ in the filter box:
    1. Note that we must use specific config groups to work around EC2’s NAT configuration
    2. Set = <Public FQDN of first NiFi Node>
    3. Save this value
    4. Click the ‘+’ icon next to this field to Override the field
    5. Select to create a new NiFi Configuration Group, name it host02
    6. Set = <Public FQDN of the second NiFi node>
    7. Save this value
    8. Repeat for each NiFi node in the cluster


    9. When all values are set in config groups, go to the ‘Manage Config Groups’ link near the filter box
    10. Select each config group and use the plus key to assign a single host to it. The goal is that each host has a specific public FQDN assigned to this parameter


  8. Check your settings and restart the NiFi service
  9. You can watch the NiFi service startup and cluster voting process by using the command ‘tail –f /var/log/nifi/nifi-app.log’ on an ssh session on one of the hosts.
  10. NiFi is up when the jetty server reports the URLs it is listening on in the log, by default this is http://<public fqdn>:9090/nifi


In this article we have deployed sets of AWS EC2 instances for HDF clusters, prepared and deployed the necessary packages, and set the necessary configuration parameters to allow NiFi SiteToSite to operate behind the AWS EC2 NAT implementation.

In the next article I will outline how to build the Dataflow for generating a small files performance test, and pushing that data efficiently via SiteToSite.

Version history
Last update:
‎08-17-2019 01:48 PM
Updated by: