Member since
02-08-2016
80
Posts
88
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3966 | 12-15-2018 08:40 PM | |
3323 | 03-29-2018 10:15 AM | |
1350 | 02-01-2018 08:08 AM | |
2120 | 01-24-2018 09:23 PM | |
1260 | 11-05-2017 03:24 PM |
03-24-2017
09:14 AM
If you want to extract the data with HDF back into the Dataflow this is not going to be the best way to do it, but the steps I outlined above would probably work.
With the requirement you describe, I would instead suggest you implement the data movement and conversion sections of your python script using a combination of SQL and TEXT processors, then just run the python script for the bits of logic needed using the ExecuteScript processor, then use HDF again to write it to the right location.
Remote executing a script and pulling the output is a more brittle process and doesn't leverage the flexibility and power of NiFi, I suspect you would spend more time troubleshooting workarounds then using the default processors that achieve the same things.
... View more
03-24-2017
08:15 AM
If you use the setup above, put your shell script to be executed on the remote machine in the 'Replacement Value' of the ReplaceText processor. It will bring back everything printed to StdOut.
What is your Python script doing? There may be a better solution we could suggest if we knew the outcome you were trying to achieve.
... View more
03-23-2017
06:02 PM
3 Kudos
NiFi would have to decrypt any encrypted data before it can be pulled into an attribute for decision making.
This obviously assumes that all the data is encrypted, and you are not providing a payload to NiFI combining unencrypted meta-data and the encrypted payload, which works very well. You could use EncryptContent or a 3rd party protection service to encrypt the data within NiFi, this has the benefit of the action being included in the NiFi Provenance record.
The Data would appear in the NiFi Content repository, so you would protect this, typically on an encrypted volume.
The other repositories would not have sensitive data in them unless you copied it there, such as using ExtractText to write the unencrypted CC# to an Attribute. Bonus option: Use a tool which provides format preserving encryption to allow routing decisions without exposing sensitive data, either in or before NiFi.
... View more
03-23-2017
05:15 PM
3 Kudos
In addition to above answers, I've had success using the ExecuteStream processor to stream a script to a remote host and pull the results back into the Dataflow. So you could run it using parameters from the flow and without needing to deploying the script before calling it.
In this example, we kinit and run a hive query against a remote kerberised cluster node. Processor Name ReplaceText Search Value (?s:^.*$) Replacement Value klist -s || kinit user@DOMAIN.COM -k -t ~/user.keytab
hive -e "Some Hive Query" Replacement Strategy Always Replace Route into: Processor Name ExecuteStreamCommand Command Arguments -o StrictHostKeyChecking=no user@hostname.fqdn Command Path ssh
... View more
03-23-2017
04:05 PM
7 Kudos
HDF Version: 2.1.2 Test Environment: Single AWS VPC with multiple subnets, EC2 Security Groups with port rules controlling access between subnets, AWS LogFlow to track accepted/rejected traffic, 1 EC2 instance for Management, 3 EC2 instances for worker nodes, Laptop on public IP for remote access. Deployment / Testing Method:
Installed baseline of Ambari, Ambari Infra, Ambari Metrics, NiFi Certificate Authority, Zookeeper Cluster, NiFi Cluster, Clients Added Ranger Enabled SSL across all services Tested all commonly used interfaces, checked for rejected traffic Not Tested:
Using External LDAP/AD services Using external Certificate Authority Connecting to Auxiliary services Deployment Topology Zone M(anagement) D(ata) R(emote) P(ublic) Members Management Node Worker Nodes 1,2,3 Other NiFi Cluster Users Services Ambari Infra (Infra Solr), Ranger, Metrics Collector, Grafana, Metrics Monitor, Nifi Certificate Authority, Clients (Infra Solr, ZooKeeper) NiFi, Zookeeper, Metrics Monitor, Clients (Zookeper) NiFi Browser, ssh client Firewall Rules: Source Zone Dest Zone Port Notes M D 22 ssh, if used for deployment M D 8670 Ambari Agent M D 3000, 61300, 61310, 61330, 61320, 61388, 61288, 61181, 2181, 60200, 6188 Ambari Metrics Service M D 2181, 61181 ZooKeeper D M 8080 (http), 8443 (https) Ambari Interface D M 8440, 8441 Ambari Agents D M 6182, 6080, 8886 Ranger Services D M 3000, 61300, 61310, 61330, 61320, 61388, 61288, 61181, 2181, 60200, 6188 Ambari Metrics Service P M 8080 (http), 8443 (https) Ambari P M 6080 Ranger P M 3000 Grafana P M 8886 Solr Admin P M 22 ssh P D 22 ssh (optional) P D 9090 (http), 9091(https) NiFi Interface D R 9090 (http), 9091(https) NiFi Interface & Data Transfer Additional Port Considerations: Ports for all Zones to connect to LDAP/AD if used Ports for all Zones to send Logging and Alerts (smtp etc.) to other systems Ports for NiFi to connect to target systems, e.g. HDFS, Hive, Kafka, etc. You will require access to your CA to generate and move certificates; it is probably not necessary to open a Port for direct connection
... View more
Labels:
03-14-2017
02:09 PM
2 Kudos
Introduction What we’re doing: Committing to <2 hours setup time Deploying a matched pair of 3 node HDF clusters into two different zones on AWS EC2 Configuring some basic performance optimisations in EC2/NiFi Setting up site-to-site to work with EC2's public/private FQDNs What you will need: AWS account with access / credits to deploy 3x EC2 machines in two different regions Approx 80Gb SSD disk per machine, for 480GB total 3x Elastic IPs per region (Preferable) iTerm or similar ssh client with broadcast capability You should use two different AWS regions for your clusters, I’ll refer to them as regionA and regionB Caveats: We’re going to use a script to setup OS pre-requisites and deploy HDF on the server nodes to save time, as this article is specific to AWS EC2 setup and not generic HDF deployment We’re not setting up HDF Security, though we will restrict access to our nodes to specific IPs This is not best practice for all use cases, particularly on security; you are advised to take what you learn here and apply it intelligently to your own environment needs and use cases You don't have to use Elastic IPs, but they'll persist through environment reboots and therefore prevent FQDN changes causing your services to need reconfiguration Process Part 1: EC2 Public IPs, Instance & OS setup, and HDF packages deployment Create Public IPs
Login to your AWS:EC2 account and select regionA Select the ‘Elastic IPs’ interface Allocate 3x new addresses, note them down Switch to regionB and repeat Launch EC2 instances
Run the ‘Launch Instance’ wizard From ‘AWS Marketplace’, select ‘CentOS 7 (x86_64) - with Updates HVM’ Select ‘t2.xlarge’ (4x16) This is the minimum that will reasonably run the HDF stack, choose bigger if you prefer. Same for Node count and disk size below Set Number of Instances to ‘3’ Set root volume to 20Gb General Purpose SSD Add 3x New Volumes of the same configuration Set the ‘Name’ tag to something meaningful Create a new security group called ‘HDF Performance Test’
Set a rule to allow ‘All traffic’ to ‘My IP’ Add 3x new rules to allow ‘All Traffic’ to the Elastic IPs in regionB (the other region) that you created earlier Add the local subnet for internode communication Optional: You could create rules for the specific ports required, at the cost of a much longer configuration. Ports 22 (ssh), 8080 (Ambari), 9090-9092 (NiFi) should be sufficient for a no-security install Review your configuration options, and hit Launch Select your ssh key preference; either use an existing key or create a new one and download it Once the launcher completes, go to Elastic IPs Associate each Instance with an Elastic IP Note down the Private and Public DNS (FQDN) for each instance; the Public should have similar values to the Elastic IPs you allocatedWhile the deployment finishes, repeat Steps 1- 13 in the other region to create your matching sets of EC2 instances OS Setup and package installs
(optional) Launch iTerm, and open a broadcast tab for every node
Otherwise issue commands to each node sequentially (much slower)
yum update -y
mkfs -t ext4 /dev/xvdb
mkfs -t ext4 /dev/xvdc
mkfs -t ext4 /dev/xvdd
mkdir /mnt/nifi_content
mkdir /mnt/nifi_flowfile
mkdir /mnt/nifi_prov
mount /dev/xvdb /mnt/nifi_content/
mount /dev/xvdc /mnt/nifi_flowfile/
mount /dev/xvdd /mnt/nifi_prov/
echo "/dev/xvdb /mnt/nifi_content ext4 errors=remount-ro 0 1" >> /etc/fstab
echo "/dev/xvdc /mnt/nifi_flowfile ext4 errors=remount-ro 0 1" >> /etc/fstab
echo "/dev/xvdd /mnt/nifi_prov ext4 errors=remount-ro 0 1" >> /etc/fstab
We’re going to use a script to do a default install of Ambari for HDF, as we’re interested in looking at NiFi rather than overall HDF setup
https://community.hortonworks.com/articles/56849/automate-deployment-of-hdf-20-clusters-using-ambar.html Follow steps 1 – 3 in this guide, reproduced here for convenience Tip: using the command ‘curl icanhazptr.com’ will provide you the FQDN of the current ssh session for convenience Run these commands as root on the first node, which I assume will be running Ambari server export hdf_ambari_mpack_url=http://public-repo-1.hortonworks.com/HDF/centos7/2.x/updates/2.1.2.0/tars/hdf_ambari_mp/hdf-ambari-mpack-2.1.2.0-10.tar.gz
yum install -y git python-argparse
git clone https://github.com/seanorama/ambari-bootstrap.git
export install_ambari_server=true
~/ambari-bootstrap/ambari-bootstrap.sh
ambari-server install-mpack --mpack=${hdf_ambari_mpack_url} --purge --verbose #enter 'yes' to purge at prompt
ambari-server restart
Assuming Ambari will run on the first node in the cluster, run these commands as root on every other node export ambari_server= <FQDN of host where ambari-server will be installed>;
export install_ambari_server=false
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh ; Deploy and Configure multi-node HDF Clusters on AWS EC2 HDF Cluster Deployment Open a browser to port 8080 of the public FQDN of the Ambari Server Login using the defaults credentials of admin/admin Select ‘Launch Install Wizard’ Name your cluster Accept the default versions and repos Fill in these details:
Provide the list of Private FQDNs in the ‘Target Hosts’ panel Select ‘Perform manual registration on hosts’ and accept the warning Wait while hosts are confirmed, then hit Next
If this step fails, check you provided the Private FQDNs, and not the Public FQDNs Select the following services: ZooKeeper, Ambari Infra, Ambari Metrics, NiFi Service layout
Accept the default Master Assignment Use the ‘+’ key next to the NiFi row to add NiFi instances until you have one on each Node Unselect the Nifi Certificate Service and continue Customize Services
Provide a Grafana Admin Password in the ‘Ambari Metrics’ tab Provide Encryption Passphrases in the NiFi tab, they must be at least 12 characters When you hit Next you may get Configuration Warnings from Ambari; resolve any Errors and continue Hit Deploy and monitor the process Repeat steps 1 – 13 on the other cluster NiFi Service Configuration for multi-node cluster on AWS EC2 Login to the cluster In the NiFi service panel, go to the Configs tab Enter ‘repository’ in the top right filter box, and change the following
Nifi content repository default dir = /mnt/nifi_content Nifi flowfile repository dir = /mnt/nifi_flowfile Nifi provenance repository default dir = /mnt/nifi_prov Enter ‘mem’ in the filter box:
Set Initial memory allocation = 2048m Set Max memory allocation = 8096m Enter ‘port’ in the filter box:
Note down the NiFi HTTP port (Non-SSL), default is 9090 Set nifi.remote.input.socket.port = 9092 Save your changes Enter ‘nifi.remote.input.host’ in the filter box:
Note that we must use specific config groups to work around EC2’s NAT configuration Set nifi.remote.input.host = <Public FQDN of first NiFi Node> Save this value Click the ‘+’ icon next to this field to Override the field Select to create a new NiFi Configuration Group, name it host02 Set nifi.remote.input.host = <Public FQDN of the second NiFi node> Save this value Repeat for each NiFi node in the cluster When all values are set in config groups, go to the ‘Manage Config Groups’ link near the filter box Select each config group and use the plus key to assign a single host to it. The goal is that each host has a specific public FQDN assigned to this parameter Check your settings and restart the NiFi service You can watch the NiFi service startup and cluster voting process by using the command ‘tail –f /var/log/nifi/nifi-app.log’ on an ssh session on one of the hosts. NiFi is up when the jetty server reports the URLs it is listening on in the log, by default this is http://<public fqdn>:9090/nifi Summary In this article we have deployed sets of AWS EC2 instances for HDF clusters, prepared and deployed the necessary packages, and set the necessary configuration parameters to allow NiFi SiteToSite to operate behind the AWS EC2 NAT implementation. In the next article I will outline how to build the Dataflow for generating a small files performance test, and pushing that data efficiently via SiteToSite.
... View more
Labels:
03-07-2017
05:07 PM
I haven't tested this recently, but I'm pretty sure that it won't like the ':' in the filename, and having spaces is bad practice generally.
I suggest you use 'yyyy-MM-dd_HH-mm-ss'
... View more
03-07-2017
08:58 AM
2 Kudos
The filename attribute set in the UpdateAttribute processor looks to have the wrong quote character, probably a backtick instead of a single quote. You can tell by the colour of the trailing ')}'; The ')' is the rusty string colour instead of the control character black, and the '}' is bright red, indicating an error instead of the attribute grey.
... View more
02-06-2017
10:09 AM
1 Kudo
@Naresh Kumar Korvi You want it to look a bit like this: Note the header, footer, and demarcator - this will aggregate your json records into a properly formatted document for later reading Set a max bin age so the final few messages will not get stuck in the queue Set a min size and min number of entires to stop lots of little files being written Set a max size and max entries that generate a file of the size you want to work with Play with the values a bit using the GenerateFlowfile processor to create appropriately sized content to test with if your Kafka dataflow is a bit slow. Your flow should be ConsumeKafka -> MergeContent -> UpdateAttribute (set filename, path) -> PutHDFS
... View more
01-23-2017
10:41 AM
Doesn't work on RHEL7, can't pass to customers as an easy process to follow
... View more
- « Previous
- Next »