About myoung

myoung · ‎10-03-2016

Objectives Upon completion of this tutorial, you should have a 3 node cluster of NiFi running on CentOs 7.2 using Vagrant and VirtualBox. Prerequisites You should already have installed VirtualBox 5.1.x. Read more here: VirtualBox You should already have installed Vagrant 1.8.6. Read more here: Vagrant NOTE: Version 1.8.6 fixes an annoying bug with permissions of the authorized_keys file and SSH. I highly recommend you upgrade to 1.8.6. You should already have installed the vagrant-vbguest plugin. This plugin will keep the VirtualBox Guest Additions software current as you upgrade your kernel and/or VirtualBox versions. Read more here: Vagrant vbguest plugin You should already have installed the vagrant-hostmanager plugin. This plugin will automatically manage the /etc/hosts file on your local mac and in your virtual machines. Read more here: Vagrant hostmanager plugin Scope This tutorial was tested in the following environment: Mac OS X 10.11.6 (El Capitan) VirtualBox 5.1.6 Vagrant 1.8.6 vagrant-vbguest plugin 0.13.0 vagrant-hostnamanger plugin 1.8.5 Apache NiFi 1.0.0 Steps Create Vagrant project directory Before we get started, determine where you want to keep your Vagrant project files. Each Vagrant project should have its own directory. I keep my Vagrant projects in my ~/Development/Vagrant directory. You should also use a helpful name for each Vagrant project directory you create. $ cd ~/Development/Vagrant $ mkdir centos7-nifi-cluster $ cd centos7-nifi-cluster We will be using a CentOS 7.2 Vagrant box, so I include centos7 in the Vagrant project name to differentiate it from a CentOS 6 project. The project is for NiFi, so I include that in the name. And this project will have a cluster of machines. Thus we have a project directory name of centos7-nifi-cluster. Create Vagrantfile The Vagrantfile tells Vagrant how to configure your virtual machines. The name of the file is Vagrantfile and it should be created in your Vagrant project directory. You can choose to copy/paste my Vagrantfile below or you can download it from the attachments to this article. # -*- mode: ruby -*- # vi: set ft=ruby : # Using yaml to load external configuration files require 'yaml' Vagrant.configure("2") do |config| # Using the hostmanager vagrant plugin to update the host files config.hostmanager.enabled = true config.hostmanager.manage_host = true config.hostmanager.manage_guest = true config.hostmanager.ignore_private_ip = false # Loading in the list of commands that should be run when the VM is provisioned. commands = YAML.load_file('commands.yaml') commands.each do |command| config.vm.provision :shell, inline: command end # Loading in the VM configuration information servers = YAML.load_file('servers.yaml') servers.each do |servers| config.vm.define servers["name"] do |srv| srv.vm.box = servers["box"] # Speciy the name of the Vagrant box file to use srv.vm.hostname = servers["name"] # Set the hostname of the VM srv.vm.network "private_network", ip: servers["ip"], :adapater=>2 # Add a second adapater with a specified IP srv.vm.network :forwarded_port, guest: 22, host: servers["port"] # Add a port forwarding rule srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" srv.vm.provider :virtualbox do |vb| vb.name = servers["name"] # Name of the VM in VirtualBox vb.cpus = servers["cpus"] # How many CPUs to allocate to the VM vb.memory = servers["ram"] # How much memory to allocate to the VM vb.customize ["modifyvm", :id, "--cpuexecutioncap", "33"] # Limit to VM to 33% of available CPU end end end end Create a servers.yaml file The servers.yaml file contains the configuration information for our virtual machines. The name of the file is servers.yaml and it should created in your Vagrant project directory. This file is loaded in from the Vagrantfile. You can choose to copy/paste my servers.yaml file below or you can download it from the attachments to this article. --- - name: nifi01 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.101 port: 10122 - name: nifi02 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.102 port: 10222 - name: nifi03 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.103 port: 10322 Create a commands.yaml file The commands.yaml file contains a list of commands that should be run on each virtual machine when it is first provisioned. The name of the file is commands.yaml and it should created in your Vagrant project directory. This file is loaded in from the Vagrantfile and allows us to automate configuration tasks that would otherwise be tedious and/or repetitive. You can choose to copy/paste my commands.yaml file below or you can download it from the attachments to this article. - "sudo yum -y install net-tools ntp wget java-1.8.0-openjdk java-1.8.0-openjdk-devel" - "sudo systemctl enable ntpd && sudo systemctl start ntpd" - "sudo systemctl disable firewalld && sudo systemctl stop firewalld" - "sudo sed -i --follow-symlinks 's/^SELINUX=.*/SELINUX=disabled/g' /etc/sysconfig/selinux" Start the virtual machines Once you have created the 3 files in your Vagrant project directory, you are ready to start your cluster. Creating the cluster for the first time and starting it every time after that uses the same command: $ vagrant up Once the process is complete you should have 3 servers running. You can verify by looking at VirtualBox. You should notice I have 3 virtual machines running called nifi01, nifi02 and nifi03: Connect to each virtual machine You are able to login to each of the virtual machines via ssh using the vagrant ssh command. You must specify the name of the virtual machine you want to connect to. $ vagrant ssh nifi01 Verify that you can login to each of the virtual machines: nifi01, nifi02, and nifi03. Download Apache NiFi We need to download the NiFi distribution file so that we can install it on each of our nodes. Instead of downloading it 3 times, we will download it once on our Mac. We'll copy the file to our Vagrant project directory where each of our virtual machines can access the file via the /vagrant mount point. $ cd ~/Documents/Vagrant/centos7-nifi-cluster $ curl -O http://mirror.cc.columbia.edu/pub/software/apache/nifi/1.0.0/nifi-1.0.0-bin.tar.gz NOTE: You may want to use a different mirror if you find your download speeds are too slow. Create nifi user We will be running NiFi as the nifi user. So we need to create that account on each server. $ vagrant ssh nifi01 $ sudo useradd nifi -d /home/nifi Repeat the process for nifi02 and nifi03. I recommend having 3 terminal windows open from this point forward, one for each of the NiFi servers. Verify Host Files Now that you are logged into each server, you should verify the /etc/hosts file on each server. You should notice the Vagrant hostmanager plugin as updated the /etc/hosts file with the ip address and hostnames of the 3 servers. NOTE: If you see 127.0.0.1 nifi01 (or nifi02, nifi03) at the top of the /etc/hosts file, delete that line. It will cause issues. The only entry with 127.0.0.1 should be the one with localhost. UPDATE: If you use the updated Vagrantfile with the "sed" command, this will remove the extraneous entry at the top of the host file. The following line was added to the Vagrantfile to fix the issue. srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" Extract nifi archive We will be running NiFi from our /opt directory, which is where we will extract the archive. You should already be connected to the server from the previous step. $ cd /opt $ sudo tar xvfz /vagrant/nifi-1.0.0-bin.tar.gz $ sudo chown -R nifi:nifi /opt/nifi-1.0.0 Repeat the process for nifi02 and nifi03 Edit nifi.properties file We need to modify the nifi.properties file to setup clustering. The nifi.properties file is the main configuration file and is located at <nifi install>/conf/nifi.properties. In our case it should be located at /opt/nifi-1.0.0/conf/nifi.properties. $ sudo su - nifi $ cd /opt/nifi-1.0.0/conf $ vi nifi.properties You should edit the following lines in the file on each of the servers: nifi.web.http.host=nifi01 nifi.state.management.embedded.zookeeper.start=true nifi.cluster.is.node=true nifi.cluster.node.address=nifi01 nifi.cluster.node.protocol.port=9999 nifi.zookeeper.connect.string=nifi01:2181,nifi02:21818,nifi03:2181 nifi.remote.input.host=nifi01 nifi.remote.input.secure=false nifi.remote.input.socket.port=9998 NOTE: Make sure you enter the hostname value that matches the name of the host you are on. You have the option to specify any port for nifi.cluster.node.protocol.port as long as there are no conflicts on the server and it matches the other server configurations. You have the option to specify any port for nifi.remote.input.socket.port as long as there are no conflicts on the server and it matches the other server configurations. The nifi.cluster.node.protocol.port and nifi.remote.input.socket.port should be different values. If you used vi to edit the file, press the following key sequence to save the file and exit vi: :wq Edit zookeeper.properties file We need to modify the zookeeper.properties file on each of the servers. The zookeeper.properties file is the configuration file for zookeeper and is located at <nifi install>/conf/zookeeper.properties. In our case it should be located at /opt/nifi-1.0.0/conf/zookeeper.properties. We are providing the list of known zookeeper servers. $ cd /opt/nifi-1.0.0/conf $ vi zookeeper.properties Delete the line at the bottom of the file: server.1= Add these three lines at the bottom of the file: server.1=nifi01:2888:3888 server.2=nifi02:2888:3888 server.3=nifi03:2888:3888 If you used vi to edit the file, press the following key sequence to save the file and exit vi: :wq Create zookeeper state directory Each NiFi server is running an embedded Zookeeper server. Each zookeeper instance needs a unique id, which is stored in the <nifi home>state/zookeeper/myid file. In our case, that location is /opt/nifi-1.0.0./state/zookeeper/myid. For each of the hosts, you need to create the myid file. The ids for each server are: nifi01 is 1, nifi02 is 2 and nifi03 is 3. $ cd /opt/nifi-1.0.0 $ mkdir -p state/zookeeper $ echo 1 > state/zookeeper/myid Remember that on nifi02 you echo 2 and on nifi03 you echo 3 . Start NiFi Now we should have everything in place to start NiFi. On each of the three servers run the following command: $ cd /opt/nifi-1.0.0 $ bin/nifi.sh start Monitor NiFi logs You can monitor the NiFi logs by using the tail command: $ tail -f logs/nifi-app.log Once the servers connect to the cluster, you should notice log messages similar to this: 2016-09-30 13:22:59,260 INFO [Clustering Tasks Thread-2] org.apache.nifi.cluster.heartbeat Heartbeat created at 2016-09-30 13:22:59,253 and sent to nifi01:9999 at 2016-09-30 13:22:59,260; send took 5 millis Access NiFi UI Now you can access the NiFi web UI. You can log into any of the 3 servers using this URL: http://nifi01:8080/nifi You should see something similar to this: Notice the cluster indicator in the upper left shows 3/3 which means that all 3 of our nodes are in the cluster. Notice the upper right has a post-it note icon. This icon gives you recent messages and will be colored red. You can see this screenshot showing a message about receiving a heartbeat. Try accessing nifi02 and nifi03 web interfaces. Shutdown the cluster To shutdown the cluster, you only need to run the vagrant command: $ vagrant halt Restarting the cluster When you restart the cluster, you will need to log into each server and start NiFi as it is not configured to auto start. Review If you successfully worked through the tutorial, you should have a Vagrant configuration that you can bring up at any time by using vagrant up and bringing down by using vagrant halt . You should also have Apache NiFi configured on 3 servers to run in a clustered configuration.

myoung · ‎09-29-2016

Overview This tutorial is intended to walk you through the process of creating a Process Group in NiFi to feed multiple Twitter streams to Elasticsearch. This tutorial is the second part of a two part series. The first part can be found here: HCC Article. In this part of the series, we will create a process group which contains multiple Twitter feeds funneled to a single Elasticsearch instance. This allows you to have multiple feeds of data with different processing needs prior to pushing to Elasticsearch. We will be able to query Elasticsearch to see data from both of our example streams. Admittedly this is a contrived example. However, the concept is fundamentally useful across a variety of NiFi use cases. NOTE: The only required software components are NiFi and Elasticsearch which can be run in just about any Linux environment. However I recommend deploying these as part of your HDP sandbox or test cluster allowing for a broader integration of tools and capabilities such as Pig, Hive, Zeppelin, etc. Prerequisites You should already have completed the Using NiFi GetTwitter, UpdateAttributes and ReplaceText processors tutorial and associated prerequisites: HCC Article Scope This tutorial was tested using the following environment and components: Mac OS X 10.11.6 HDP 2.5 Tech Preview on Hortonworks Sandbox, although it should work for any HDP 2.5 deployments Apache NiFi 1.0.0 (Read more here: Apache NiFi) Elasticsearch 2.4.0, although it should work for any Elasticsearch version > 2.x (Read more here: Elasticsearch) Steps We are picking up from where the last tutorial left off. We currently have a single dataflow. Our GetTwitter processor combines multiple filters. We would like to clearly define two GetTwitter processors, each with their own filter. This is what our current data flow looks like: Create a process group The first thing we are going to do is to add a Process Group to our NiFi canvas. To do this, drag the Process Group icon from the menu bar to the canvas area. Here is a screenshot showing the Process Group icon: Once you drag the process group icon to the canvas, the Add Process Group dialog will be displayed. It should look similar to this: Give the process a group a meaningful name. In our case, we will call it Twitter Feed . Click the ADD button. The process group will be added to the canvas. You should see something similar to this: You should drag the process group so that it is easier to see. You should see something similar to this: Copy data flow to process group Now we want to copy our existing flow to the process group we just created. Select all 4 of the processors in our current flow. Press the COMMAND-C (or CTRL-C on Windows) to copy the selected components. Now double click on the Twitter Feed process group. This will open the process group. You should see something similar to this: Notice the canvas is now blank? You should also notice the bread crumb navigation in the lower left of the screen. NiFi Flow >> Twitter Feed is your indication that you are inside the process group. Now we can paste our copied flow files on the canvas. You should be able to press the COMMAND-V (CTRL-V on Windows) to copy our flow to the process group. You should see something similar to this: You should see the 4 processors copied to the canvas. You should also notice the connections are missing. We need to reestablish the connections. Before doing that, we are going to delete the PutElasticsearch processor. It already exists outside of the process group and we don't need a copy inside. Delete PutElasticsearch processor Inside of your processor group (missing connections and bread crumb in lower left will confirm this), select the PutElasticsearch processor by clicking on it. Now you can delete it by pressing the delete key. You should see something similar to this: Create connections between processors Now we are going to create connections between the 3 processors. Drag the circle arrow icon from the GetTwitter processor to the UpdateAttribute processor. You don't need to change anything; click the ADD button. Drag the circle arrow icon from the UpdateAttribute processor to the ReplaceText processor. You don't need to change anything; click the ADD button. You should see something similar to this: You should notice a red triangle in the upper left of the ReplaceText processor. That is because we haven't connected it to anything yet. We'll get to that shortly. Edit GetTwitter processor This first dataflow will be for our elasticsearch related tweets. We need to edit the GetTwitter processor to filter only on elasticsearch . Right click on the processor and select configure . Click the PROPERTIES tab. Click on the Terms to Filter On value field to edit the value. Enter elasticearch as the only term. Click the OK button to save the change. You should see something similar to this: Click the APPLY button to save the change. Copy data flow We need a similar data flow within this process group. The second data flow should be filtering on the term solr . To do that, select all 3 processors and press the COMMAND-C keys. Now press the COMMND-V keys to paste a copy of the processors. You should see something similar to this: The processors you copied should still be selected. Let's move them so it's easier to see the two flows. Drag the selected processors to the right. You should see something similar to this: Edit GetTwitter processor We need to edit the GetTwitter processor for the second data flow. Follow the same procedure we did the first time, only this time use the term solr . You should have something that looks like this: Create connections between processors As we did before, create the connections between the processors in the second flow. Drag the circle arrow icon from the GetTwitter processor to the UpdateAttribute processor. You don't need to change anything; click the ADD button. Drag the circle arrow icon from the UpdateAttribute processor to the ReplaceText processor. You don't need to change anything; click the ADD button. You should see something similar to this: Create Output Port We need the data flow from this process group to be sent outside of the group to enable connections to our Elasticsearch processor. To enable this, we are going to add an Output Port. Drag the Output Port icon from the menu bar to the canvas area. Here is a screenshot showing the Output Port icon: An Add Port dialog should be displayed. You should see something similar to this: This is a user-friendly name for the port that will be created. We'll call our port From Twitter Feed . Click the ADD button to add the port. You should see something similar to this: You should notice a red triangle in the upper left of our From Twitter Feed Output Port. This because there is no connection defined yet. Create connections to Output Port Now we need to create a connection from each of the ReplaceText processors to the Output Port. To do this, drag the circle arrow icon from the ReplaceText processor the Output Port. A Create Connection dialog will be displayed. Select the success relationship. Click the ADD button to create the connection. Do this for both ReplaceText processors. Now you should see something similar to this: Create connection between Process Group and PutElasticsearch processor Now we are ready to create the connection between our Process Group and our PutElasticsearch processor. Using the bread crumb navigation in the lower left, click on the NiFi Flow link to go up a level. You should see something similar to this: We no longer need the GetTwitter, UpdateAttribute and ReplaceText processors on the main canvas. Select each of the connections between the processors and delete the connections with the delete key. You should see something similar to this: Now delete the GetTwitter, UpdateAttribute and ReplaceText processors from the main canvas. We want to keep the PutElasticsearch processor. You should see something similar to this: Create a connection between the process group and the PutElasticsearch processor by dragging the circle arrow icon from the process group to the PutElasticsearch processor. A Create Connection dialog will be displayed. You don't need to change any options, so click the ADD button to create the connection. You should have something that looks similar to this: If you look inside your process group now, you should notice the red triangle is gone for the Output Port. That is because a connection exists now. Start processors Now we can start our processes to test our flow. If you click on the processor group and then click the start arrow icon, that will start all of the processors inside the processor group. You should notice the start arrow icon in the processor group goes from 0 to 7 and the stop square icon goes from 7 to 0. Because we are filtering on specific terms, it may take 20 or 30 minutes before any matching tweets are pulled in. Be patient. Once tweets start coming in you should see something similar to this: You should notice the tweets are queuing up. We have not yet started our PutElasticsearch processor. Go ahead and do that now. Click on the PutElasticsearch process and click on the start arrow icon. You should see something similar to this: You should noticed the queued tweets have been processed and are now in Elasticsearch. Query Elasticsearch We can now query Elasticsearch using the custom field we created, twitterFilterAttribute . If you let the data flow run log enough, you should have at least a few tweets for each GetTwitter processor. In your broswer window, query Elasticsearch using the following http://sandbox.hortonworks.com:9200/twitter_new/_search?pretty . You should see something similar to this: <code>{ "took" : 26, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 252831, "max_score" : 1.0, "hits" : [ { "_index" : "twitter_new", "_type" : "default", "_id" : "0827ce3c-21ab-4dfa-9d17-0ba90c116142", "_score" : 1.0, "_source" : { "created_at" : "Thu Sep 15 13:56:06 +0000 2016", "id" : 776419323955048448, "id_str" : "776419323955048448", "text" : "RT @cymia: I have the biggest heart I swear.", "source" : "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 997413793, "id_str" : "997413793", "name" : "#######", "screen_name" : "#######", "location" : "Future In The Present", "url" : null, "description" : "~~_Trust No Bitch⚔ Y'all Opinions Doesn't Define Who I Am ✨ C/o '17 \uD83D\uDC10\uD83C\uDF93", "protected" : false, "verified" : false, "followers_count" : 453, "friends_count" : 391, "listed_count" : 0, "favourites_count" : 1448, "statuses_count" : 6803, "created_at" : "Sat Dec 08 15:41:26 +0000 2012", "utc_offset" : -14400, "time_zone" : "#######", "geo_enabled" : false, "lang" : "#######", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "BADFCD", "profile_background_image_url" : "#######", "profile_background_image_url_https" : "#######", "profile_background_tile" : false, "profile_link_color" : "FF0000", "profile_sidebar_border_color" : "F2E195", "profile_sidebar_fill_color" : "FFF7CC", "profile_text_color" : "0C3E53", "profile_use_background_image" : true, "profile_image_url" : "#######", "profile_image_url_https" : "#######", "profile_banner_url" : "#######", "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, ... You should notice that you should have a large number of tweets. In my case I have 252831 . Now let's query against our new field. In your browser, enter the following http://sandbox.hortonworks.com:9200/twitter_new/_search?q=twitterFilterAttribute:elasticsearch&pretty . You should get a much smaller number of tweets. In my case I got 2 documents back. Here is my output: <code>{ "took" : 142, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 12.341856, "hits" : [ { "_index" : "twitter_new", "_type" : "default", "_id" : "a4226ef1-5bfe-4aff-84aa-dd357e874356", "_score" : 12.341856, "_source" : { "created_at" : "Tue Sep 27 21:56:45 +0000 2016", "id" : 780888938483425288, "id_str" : "780888938483425288", "text" : "Build a Search Engine with Node.js and Elasticsearch#######", "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 35983221, "id_str" : "35983221", "name" : "#######", "screen_name" : "#######", "location" : "#######", "url" : "#######", "description" : "Full stack web developer (PHP, Java, Rails), Docker enthusiast, gamer, also addicted to eletronic, photography and technology.", "protected" : false, "verified" : false, "followers_count" : 398, "friends_count" : 660, "listed_count" : 84, "favourites_count" : 348, "statuses_count" : 2532, "created_at" : "Tue Apr 28 04:04:57 +0000 2009", "utc_offset" : -14400, "time_zone" : "#######", "geo_enabled" : true, "lang" : "#######", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "5D7382", "profile_background_image_url" : "#######", "profile_background_image_url_https" : "#######", "profile_background_tile" : false, "profile_link_color" : "CC0000", "profile_sidebar_border_color" : "000000", "profile_sidebar_fill_color" : "EFEFEF", "profile_text_color" : "333333", "profile_use_background_image" : true, "profile_image_url" : "#######", "profile_image_url_https" : "https://pbs.twimg.com/profile_images/760303804092968965/9mekDmQy_normal.jpg", "profile_banner_url" : "#######", "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, "geo" : null, "coordinates" : null, "place" : null, "contributors" : null, "is_quote_status" : false, "retweet_count" : 0, "favorite_count" : 0, "entities" : { "hashtags" : [ ], "urls" : [ { "url" : "#######", "expanded_url" : "#######", "display_url" : "sitepoint.com/search-engine-…", "indices" : [ 53, 76 ] } ], "user_mentions" : [ ], "symbols" : [ ] }, "favorited" : false, "retweeted" : false, "possibly_sensitive" : false, "filter_level" : "low", "lang" : "en", "timestamp_ms" : "1475013405806", "twitterFilterAttribute" : "elasticsearch" } ... Notice the new field is present in the data and it contains elasticsearch as the value? Now let's query for solr. Type the following in your browser http://sandbox.hortonworks.com:9200/twitter_new/_search?q=twitterFilterAttribute:solr&pretty . You should get a similarly small number of results. In my case I got 2 documents returned. Here is my output: <code>{ "took" : 28, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 12.341865, "hits" : [ { "_index" : "twitter_new", "_type" : "default", "_id" : "b8b618db-99d8-4ac8-910b-84a20fa58396", "_score" : 12.341865, "_source" : { "created_at" : "Tue Sep 27 21:56:15 +0000 2016", "id" : 780888813157711872, "id_str" : "780888813157711872", "text" : "RT @shalinmangar: #Docker image for @ApacheSolr 6.2.1 is now available. https://t.co/lrakkMMhJn #solr", "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 362698158, "id_str" : "362698158", "name" : "#######", "screen_name" : "#######", "location" : "#######", "url" : "#######", "description" : "Digital Photography, Information Retrieval, Data Warehousing, Big Data, Cloud Computing. Solutions Engineer @ Hortonworks.", "protected" : false, "verified" : false, "followers_count" : 310, "friends_count" : 732, "listed_count" : 48, "favourites_count" : 108, "statuses_count" : 5003, "created_at" : "Fri Aug 26 20:53:04 +0000 2011", "utc_offset" : null, "time_zone" : null, "geo_enabled" : true, "lang" : "en", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "C6E2EE", "profile_background_image_url" : "#######", "profile_background_image_url_https" : "#######", "profile_background_tile" : false, "profile_link_color" : "1B95E0", "profile_sidebar_border_color" : "C6E2EE", "profile_sidebar_fill_color" : "DAECF4", "profile_text_color" : "663B12", "profile_use_background_image" : true, "profile_image_url" : "#######", "profile_image_url_https" : "#######", "profile_banner_url" : "#######, "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, "geo" : null, "coordinates" : null, "place" : null, "contributors" : null, "retweeted_status" : { "created_at" : "Tue Sep 27 12:45:52 +0000 2016", "id" : 780750304060899328, "id_str" : "780750304060899328", "text" : "#Docker image for @ApacheSolr 6.2.1 is now available. https://t.co/lrakkMMhJn #solr", "source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 7057932, "id_str" : "7057932", "name" : "#######", "screen_name" : "#######", "location" : "#######", "url" : "#######", "description" : "Engineer at Lucidworks, Committer on Apache Lucene/Solr, ex-AOLer", "protected" : false, "verified" : false, "followers_count" : 1431, "friends_count" : 388, "listed_count" : 106, "favourites_count" : 903, "statuses_count" : 3758, "created_at" : "Sun Jun 24 22:50:00 +0000 2007", "utc_offset" : 19800, "time_zone" : "New Delhi", "geo_enabled" : true, "lang" : "en", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "EDECE9", "profile_background_image_url" : "#######", "profile_background_image_url_https" : "#######", "profile_background_tile" : false, "profile_link_color" : "088253", "profile_sidebar_border_color" : "D3D2CF", "profile_sidebar_fill_color" : "E3E2DE", "profile_text_color" : "634047", "profile_use_background_image" : false, "profile_image_url" : "#######", "profile_image_url_https" : "#######", "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, "geo" : null, "coordinates" : null, "place" : null, "contributors" : null, "is_quote_status" : false, "retweet_count" : 8, "favorite_count" : 9, "entities" : { "hashtags" : [ { "text" : "Docker", "indices" : [ 0, 7 ] }, { "text" : "solr", "indices" : [ 78, 83 ] } ], "urls" : [ { "url" : "#######", "expanded_url" : "#######", "display_url" : "#######", "indices" : [ 54, 77 ] } ], "user_mentions" : [ { "screen_name" : "ApacheSolr", "name" : "Apache Solr", "id" : 22742048, "id_str" : "22742048", "indices" : [ 18, 29 ] } ], "symbols" : [ ] }, "favorited" : false, "retweeted" : false, "possibly_sensitive" : false, "filter_level" : "low", "lang" : "en" }, "is_quote_status" : false, "retweet_count" : 0, "favorite_count" : 0, "entities" : { "hashtags" : [ { "text" : "Docker", "indices" : [ 18, 25 ] }, { "text" : "solr", "indices" : [ 96, 101 ] } ], "urls" : [ { "url" : "#######", "expanded_url" : "#######", "display_url" : "h#######", "indices" : [ 72, 95 ] } ], "user_mentions" : [ { "screen_name" : "shalinmangar", "name" : "Shalin Mangar", "id" : 7057932, "id_str" : "7057932", "indices" : [ 3, 16 ] }, { "screen_name" : "ApacheSolr", "name" : "Apache Solr", "id" : 22742048, "id_str" : "22742048", "indices" : [ 36, 47 ] } ], "symbols" : [ ] }, "favorited" : false, "retweeted" : false, "possibly_sensitive" : false, "filter_level" : "low", "lang" : "en", "timestamp_ms" : "1475013375926", "twitterFilterAttribute" : "solr" } ... Look for the twitterFilterAttribute field. You should see it has the value solr . Review If you were able to successfully work through the tutorial, you should have a good understanding how to create multiple flows within a process group and how to feed that data to an output port. In this tutorial, we created 2 feeds for different Twitter filters which added a new field called twitterFilterAttribute to the Twitter JSON data. This field is now searchable within Elasticsearch to easily filter sources of data using a single index. Next Steps For next steps, you could try using the RouteOnAttribute processor to direct the flow to different Elasticsearch processors which write to different indexes.

myoung · ‎09-22-2016

Objective This tutorial is intended to walk you through the process of using the GetTwitter, UpdateAttribute, ReplaceText and PutElasticsearch processors in Apache NiFi to modify Twitter JSON data before sending it to Elasticsearch. This tutorial is the first part of a two part series. In this part of the series, we will create a single data flow that adds an additional field to the JSON data called twitterFilterAttribute using the ReplaceText processor. This will allow us to query Elasticsearch using a fielded query like q=twitterFilterAttribute:elasticsearch. The second part of the series will build on this example to create a process group with two GetTwitter feeds: one with elasticsearch term filter and the other with a solr term filter. Admittedly this is a contrived example. However, the concept is fundamentally useful across a variety of NiFi use cases. NOTE: The only required software components are NiFi and Elasticsearch which can be run in just about any linux environment. However I recommend deploying these as part of your HDP sandbox or test cluster allowing for a broader integration of tools and capabilities such as Pig, Hive, Zeppelin, etc. Prerequisites You should already have completed the NiFi + Twitter + Elasticsearch tutorial and associated prerequisites: HCC Article Scope This tutorial was tested using the following environment and components: Mac OS X 10.11.6 HDP 2.5 Tech Preview on Hortonworks Sandbox, although it should work for any HDP 2.5 deployments Apache NiFi 1.0.0 (Read more here: Apache NiFi) Elasticsearch 2.4.0, although it should work for any Elasticsearch version > 2.x (Read more here: Elasticsearch) Steps Stop NiFi processors If your NiFi workflow from the previous tutorial is running, then you should stop your GetTwitter and PutElasticsearch processors. Your NiFi data flow should look something similar to this: NOTE: my processors are running in this screen shot: As you can see, we have two NiFi processors. This is a very simple data flow. Remove Connection Before we add the UpdateAttribute processor, we are going to remove the connection between the GetTwitter and PutElasticsearch processors. Click on the connection between the two processors to select the connection. Now press the delete or backspace key to delete the connection. NOTE: You must have both processors stopped before you can delete the connection. You should now see something similar to this: Add UpdateAttribute Processor Now we are going to add the UpdateAttribute processor. Drag the processor icon from the NiFi menu bar to the data flow canvas. You will see the Add Processor dialog. Type updateattr in the filter box to filter the list of processors. You should something similar to this: Select the UpdateAttribute processor and click the ADD button. You should see something similar to this: Rearrange the processors on the canvas to make it easier to follow/trace connections later. You should have something similar to this: Configure UpdateAttribute Processor We are now going to configure the UpdateAttribute processor. Right click on the UpdateAttribute processor and select the Configure menu option. Click on the PROPERTIES tab. You should see something similar to this: We are going to add a new property. Click on the + (plus) icon. You should see something similar to this: For the Property Name, enter twitterFilterAttribute. This will add a property called twitterFilterAttribute to the flow files coming through this processor. Now click the OK button and you should see something similar to this: For the Value, enter elasticsearch. This is the value that will be added to the twitterFilterAttribute property. Now click the OK button and then the APPLY button. Add Connection Between GetTwitter and UpdateAttribute Processors We need to add a connection between the GetTwitter and UpdateAttribute processors. You do this by hovering over the GetTwitter processor until you see the circle-arrow icon. Drag the icon to the UpdateAttribute processor. You should see something similar to this: You do not need to change any settings here. Click the ADD button to add the connection. Add ReplaceText Processor We are now going to add the ReplaceText processor. Drag the processor icon from the NiFi menu bar to the data flow canvas. You will see the Add Processor dialog. Type replace in the filter box to filter the list of processors. You should something similar to this: Select the ReplaceText processor and click the ADD button. Configure ReplaceText Processor We are now going to configure the ReplaceText processor. We want to change the JSON message data and we are going to do it using Regular Expressions, which is enabled with the ReplaceText processor. You can read more on regular expressions here: Wikipedia. Here is what the message looks like coming in: { ... "filter_level" : "low", "lang" : "en", "timestamp_ms" : "1473786418611" } Here is what the message should look like going out: { ... "filter_level" : "low", "lang" : "en", "timestamp_ms" : "1473786418611", "twitterFilterAttribute"" : "elasticsearch" } We need to add a , and our new field twitterFilterAttribute with our value after the last entry in the JSON, but before the last } character. Right click on the ReplaceText processor and select the Configure menu option. Click on the PROPERTIES tab. You should see something similar to this: We need to change the Search Value and Replacement Value settings. Click on the Value box for the Search Value line. You should see something similar to this: The value in this box is a regular expression. We are going to replace the entire value with: (?s:(^.*)}$) This regular expression looks for anything from the beginning of the line to a } character at the end of the line. Any matches it finds is put into a regular expression group by the () characters. We are looking for a } at the end of the line because that is the last part of the Twitter JSON message data. You will notice that we don't include the } in the () group. This is because we need to add a value before the closing } which we'll do in the Replacement Value section. This regular expression will match everything up to the last line of the incoming message data: ... "timestamp_ms" : "1473786418611" } Once you have entered the regular expression, click the OK button. Now we are going to change the Replacement Value setting. Click on the Value box for the Replacement Value line. You should see something similar to this: The value in this box is a regular expression group. We are going to replace the entire value with: $1,"twitterFilterAttribute":"${twitterFilterAttribute}"} This will replace the entire text of the incoming data with the first matching group, which is all of the JSON twitter text without the last }. We then add a , because each JSON node needs to be separated by a comma. The "twitterFilterAttribute" text is a literal string. The ${} in the second part of that string is NiFi Expression Language. This adds the value of the attribute twitterFilterAttribute to the string. Once you have entered the regular expression, click the OK button. You should see something similar to this: You don't need to change any other settings. Click the APPLY button. NOTE: Be careful using copy/paste as sometimes smart quotes will be inserted instead of standard quotes. This will cause Elasticsearch JSON parsing to have issues. Add Connection Between UpdateAttribute and ReplaceText Processors We need to add a connection between the UpdateAttribute and ReplaceText processors. The process is the same as before. You do this by hovering over the UpdateAttribute processor until you see the circle-arrow icon. Drag the icon to the ReplaceText processor. You should see something similar to this: You do not need to change any settings here. Click the ADD button to add the connection. Add Connection Between ReplaceText and PutElasticsearch Processors We need to add a connection between the ReplaceText and PutElasticsearch processors. The process is similar to before. You do this by hovering over the ReplaceText processor until you see the circle-arrow icon. Drag the icon to the PutElasticsearch processor. You should see something similar to this: You should notice this dialog doesn't look exactly the same as before. The For Relationships gives you both success and failure options. The last two times we did this, you only have the success option. For this connection, we are going to check the success box. You do not need to change any other settings here. Click the ADD button to add the connection. Now we need to go back to the ReplaceText processor and make a change. You should notice a red triangle icon on this processor. That is because there is a failure relationship that we haven't handled. Right click on the processor and click the Configure option. Click the SETTINGS tab. For the Auto Terminate Relationships setting, check the failure option. You should see something similar to this: This setting will drop any records where the ReplaceText processor was not successful. The connection to PutElasticsearch only accepts the successful replacement attempts. Click the APPLY button to save the settings. Your final data flow should look similar to this: Turn On processors Now we can turn on all of our processors to make sure everything works. Make sure you have started Elasticsearch. You can select all of the processors by pressing the CMD-A (CTRL-A if you are on Windows) keys. You should see something similar to this: Then you can click the play arrow icon to start the flow. Verify New Field Now we should be able to query Elasticsearch and verify the new field exists. You can type the following into a browser window to query Elasticsearch: http://sandbox.hortonworks.com:9200/twitter/default/_search?q=twitterFilterAttribute:elasticsearch&pretty You should get results from Elasticsearch using this query. Troubleshooting If you do not get any results when querying Elasticsearch, verify the query above. With the default schema, it may be case sensitive. In other words, twitterFilterAttribute is not the same as twitterfilterattribute. If you experience any errors writing to Elasticsearch, the problem is likely one of two things: 1) you have not started Elasticsearch or 2) you have a copy/paste issue with smart quotes in your ReplaceText processor settings. Here are the kinds of messages you may to see if you have a smart quote issue: ,"twitterFilterAttribute":"elasticsearch”} ]} MapperParsingException[failed to parse]; nested: JsonEOFException[Unexpected end-of-input in VALUE _STRING at [Source: org.elasticsearch.common.io.stream.InputStreamStreamInput@681b0c9a; line: 1, column: 4443]]; at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:156) Review We have modified our existing, simple data flow to a create a new field in our Twitter JSON data. This field is being captured in Elasticsearch as twitterFilterAttribute and allows us to query Elasticsearch based on the values stored in this field. Next Steps Look for the next article in the series which will use process groups in NiFi with multiple Twitter streams using different filters and values for twitterFilterAttribute being written to Elasticsearch.

myoung · ‎09-19-2016

Objectives This tutorial will walk you the process of starting Atlas on the Hortonworks Sandbox for HDP 2.5. By default the service is disabled. However manually starting the service will fail unless you start the dependencies. Atlas depends on Ambari Infra (which provides Solr), Kafka and HBase. Atlas will start with just the Ambari Infra service running, you won't have proper functionality without Kafka and KBase Scope This has been testing on the following: VirtualBox 5.1.6 Hortonworks Sandbox for HDP 2.5 Steps Start Atlas Service Get your sandbox up and running and log into Ambari. Click on the Atlas service link. You should see something similar to this: Because Atlas is in maintenance mode, it will not automatically start. When you try to start it by going to Service Actions -> Start like this: You will see the following error: If you look at the error message provide you will see the problem is related to Solr: Client is connected to ZooKeeper Using default ZkACLProvider Updating cluster state from ZooKeeper... No live SolrServers available to handle this request org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:350) at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1100) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:870) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:806) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:166) This is because Atlas is using the Solr instance in the Ambari Infra service, which is in maintenance mode and does not auto start. Let's start the service. Start Ambari Infra Service If you click on the Ambari Infra Service you should see something like this: Click the Service Actions -> Start button. This should start Ambari Infra. Start Atlas Service Now that Ambari Infra is running, you should be able to start the Atlas service. Start Kafka and HBase Service While Atlas did start with only the Ambari Infra service running. It also depends on Kafka and HBase for full functionality. You should start both of those services similar to how we started the Ambari Infra service. Review The Ambari Infra service provides a Solr instance for core HDP component access. By default this service is in maintenance mode and does not start which causes the Atlas service to fail. By starting the Amabari Infra service before Atlas, you will be able to start Atlas. If you turn off maintanence mode for Ambari Infra, then it will auto start.

myoung · ‎09-15-2016

Objectives: This article will walk you through the process of creating a dashboard in Kibana using Twitter data that was pushed to Elasticsearch via NiFi. The tutorial will also cover basics of Elasticsearch mappings and templates. Prerequisites: You should already have installed the Hortonworks Sandbox (HDP 2.5 Tech Preview). You should already have completed the NiFi + Twitter + Elasticsearch tutorial here: HCC Article Make sure your GetTwitter and PutElasticsearch processors in your NiFi data flow are stopped. NOTE: While not required, I highly recommend using Vagrant to manage multiple Virtualbox environments. You can read more about converting the HDP Sandbox Virtualbox virtual machine into a Vagrant box here: HCC Article Scope This tutorial was tested using the following environment and components: Mac OS X 10.11.6 HDP 2.5 Tech Preview on Hortonworks Sandbox Apache NiFi 1.0.0 (Read more here: Apache NiFi) Elasticsearch 2.3.5 and Elasticsearch 2.4.0 (Read more here: Elasticsearch) Kibana 4.6.1 (Read more here: Kibana) Vagrant 1.8.5 (Read more here: Vagrant) VirtualBox 5.1.4 and VirtualBox 5.1.6 (Read more here: VirtualBox) Steps Download Kibana I assume you are using Vagrant to connect to your sandbox. As noted in the prerequisites, it is not required but is very handy. $ vagrant ssh Now download the Kibana software: $ cd ~ $ curl -O https://download.elastic.co/kibana/kibana/kibana-4.6.1-linux-x86_64.tar.gz Install Kibana We will be running Kibana out of the /opt directory. So extract the archive there: $ cd /opt $ sudo tar xvfz ~/kibana-4.6.1-linux-x86_64.tar.gz We will be using the elastic user, which you should have created in the prerequisite tutorial. So we need to change ownership of the Kibana files to the elastic user: $ sudo chown -R elastic:elastic /opt/kibana-4.6.1-linux-x86_64 Configure Kibana Before making any configuration changes, switch over to the elastic user: $ sudo su - elastic The Kibana configuration file is kibana.yml and it's located in the config directory. We need to edit this file: $ cd /opt/kibana-4.6.1-linux-x86_64 $ vi config/kibana.yml Kibana defaults to port 5601, but we want to set it explicitly. This port should not conflict with anything on the sandbox. Uncomment this line: #server.port: 5601 It should look like this: server.port: 5601 We want to explicitly tell Kibana to listen to the host ip address of sandbox.hortonworks.com: Uncomment this line: #server.host: "0.0.0.0" Change it to this: server.host: sandbox.hortonworks.com We also want to explicitly set the Elasticsearch host. Uncomment this line: #elasticsearch.url: "http://localhost:9200" Change it to this: elasticsearch.url: "http://sandbox.hortonworks.com:9200" Save the file Press Esc !wq Start Kibana Now we can start Kibana. The archive file does not provide a service script, so we'll run it by hand. $ bin/kibana You should see something similar to: $ bin/kibana log [17:32:15.415] [info][status][plugin:kibana@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.473] [info][status][plugin:elasticsearch@1.0.0] Status changed from uninitialized to yellow - Waiting for Elasticsearch log [17:32:15.498] [info][status][plugin:kbn_vislib_vis_types@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.518] [info][status][plugin:markdown_vis@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.523] [info][status][plugin:metric_vis@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.527] [info][status][plugin:spyModes@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.532] [info][status][plugin:statusPage@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.535] [info][status][plugin:table_vis@1.0.0] Status changed from uninitialized to green - Ready log [17:32:15.540] [info][listening] Server running at http://sandbox.hortonworks.com:5601 log [17:32:20.594] [info][status][plugin:elasticsearch@1.0.0] Status changed from yellow to yellow - No existing Kibana index found log [17:32:23.379] [info][status][plugin:elasticsearch@1.0.0] Status changed from yellow to green - Kibana index ready Access Kibana Web UI You can access the Kibana web user interface via your browser: http://sandbox.hortonworks.com:5601 When you first start Kibana, it will create a new Elasticsearch index called .kibana where it stores the visualizations and dashboards. Because this is the first time starting it, you should be prompted to configure an index pattern. You should see something similar to: We are going to use our Twitter data which is stored in the twitter index. Uncheck the Index contains time-based events option. Our data is not yet properly setup to handle time-based events. We'll fix this later. Replace logstash-* with twitter. You should see something similar to this: If everything looks correct, click green Create button. Kibana will now show you the index definition for the twitter index. You can see all of the fields names, their data types and if the fields are analyzed and indexed. This provides a good high level overview of the data configuration in the index. You can also filter fields in the filter box. You should see something similar to this: At this point, your index pattern is saved. You can now start discovering your data. Discover Twitter Data Click on the Discover link in the top navigation bar. This opens the Discover view which is a very helpful way to dive into your raw data. You should see something similar to this: At the top of the screen is the query box which allows you to filter your data based on search terms. Enter coordinates.coordinates:* in the filter box to filter results that only contain that field. On the left of the screen is the list of fields in the index. Each field has an icon to the left of the field name that indicates the data type for the field. If you click on the field name, it will expand to show you sample data for that field in the index. Looked for the field named coordinates.coordinates. The icon to the left of that field indicates that it is a number field. If you click the name of the field, you can see that it expands. You should see something similar to this: It shows the the percentage of documents where the value is present. You can experiment with other fields. Some fields will tell you the field is present in the mapping, but there are no values in the documents. Elasticsearch does not create empty fields. The area to the right of the screen shows your current search results. The small triangle icon will expand the search result to show a more user friendly and detailed view of the record in the index. Click the arrow icon for the first result. You should see something similar to this: Nested objects in the twitter data are easily seen as JSON objects. You can see entities.media, entities.urls, and entities.hashtags as examples of nested objects. This will depend on the data in your twitter index. Update Elasticsearch Configuration Before we can move on to creating visualizations, we need to update our Elasticsearch configuration. When we pushed Twitter data to Elasticsearch, you should remember that we didn't have to create the Elasticsearch index or define a mapping. Out of the box, Elasticsearch is very user friendly by dynamically evaluating your data and creating a best-guess data mapping for you. This is great for testing and evaluation as it makes the data discovery process much quicker. To see what the twitter index mapping looks like, enter this url into your browser: http://sandbox.hortonworks.com:9200/twitter/_mapping?pretty You can read more about Elasticsearch index mapping here: Elasticsearch Mapping. You should see something similar to this: As you can see, the mapping is very long and somewhat complex. It can be very time consuming to go through this entire mapping and make changes to the fields that require a different analyzer or data type. You absolutely should do this for production data sets. However, for testing and evaluation of a new data set, we can start by creating a mapping with a much smaller subset of known fields. If you want to perform specific analysis such as geospatial queries or time-series queries, then you need to ensure your Elasticsearch index mapping is properly configured for those fields. We can do this using Templates in Elasticsearch. You define a template with the mappings you care about. When a new index is created, Elasticsearch first looks for a matching template. If it finds one, it will create the index using the mapping in the template. For our purposes we will create a simple template mapping. By default Elasticsearch will fill in any of the new fields contained in the data that isn't in our mapping. It will do this by dynamically determining the data type. You can read more about templates here: Elasticsearch Templates For our dashboard, we want to do time-series analysis, geo-spatial analysis and aggregations (grouping and counts) of fields. Each of these requires the field definition to be properly configured in the index mapping. You push the template to Elasticsearch as a JSON document using curl. Here is the template and command we'll be using: $ curl -XPUT sandbox.hortonworks.com:9200/_template/twitter -d ' { "template" : "twitter*", "settings" : { "number_of_shards" : 1 }, "mappings" : { "default" : { "properties" : { "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" }, "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } }, "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" }, "lang" : { "type" : "string", "index" : "not_analyzed" } } } } } } }' The twitter* in the "template" : section is a regular expression match. Any new index created with a name that starts with twitter, such as twitter_20160914 or twitter2016, will match the template. Elasticsearch will create those indexes with the mappings defined in the template. The created_at field is the field on which we will do our time-series analysis. This field should be using a date data type. We also need to specify the data format to ensure Elasticsearch correctly parses the date. Here is that configuration for our data: "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" } Twitter data has two fields with geospatial data: geo.coordinates and coordinates.coordinates. The geo.coordinates field is in [lat,lon] format. The coordinates.coordinates field is in the [lon,lat] format. Elasticsearch requires data to be in the [lon,lat] format when passed as an array, as in our Twitter data. You can read more about it here (see note 4 for the example): Elasticsearch Geo-Point. Note this is a nested field which requires a nested mapping. Here is that configuration for our data: "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } } For aggregations where we want to do counts, we also need a special configuration. We are going to use user.screen_name and user.lang. These fields were originally mapped as strings. Elasticsearch does stemming and tokenization of strings. For single terms, this generally isn't a problem, but will cause issues for multi-word terms we want to group as a single token. For example, a screen_name of my_screen_name would be converted to "my screen name" and won't be evaluated properly when doing aggregations on that field. Read more about that here: Elasticsearch Languages. To handle this scenario we need to tell Elasticsearch to not analyze those fields. This is done using the "index" : "not_analyzed" setting for a field. Here is that configuration for our data: "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" }, "lang" : { "type" : "string", "index" : "not_analyzed" } } } You should have pushed this template to Elasticsearch using curl as shown above. You should get a response from Elasticsearch that looks like this: {"acknowledged":true} If you receive an error, double check your ' and " characters to ensure they were not replaced with "smart" versions when you copy/pasted the command. You can't update the field mappings of data already in an index. You need to create a new index which will use the updated mapping and template. Elasticsearch has a handy Reindex API to copy data from one index to another. The cool thing about this approach is the raw data is copied, but the new index will use whatever mapipng is created from the new template. You can read more about it here: Elasticsearch Reindex API. You can reindex using the new template with a simple command: $ curl -XPOST sandbox.hortonworks.com:9200/_reindex -d ' { "source": { "index": "twitter" }, "dest": { "index": "twitter_new" } }' Depending on the size of your index, this could take a few minutes. When it is complete, you will should see something similar to this: {"took":34022,"timed_out":false,"total":37114,"updated":0,"created":37114,"batches":38,"version_conflicts":0,"noops":0,"retries":0,"throttled_millis":0,"requests_per_second":"unlimited","throttled_until_millis":0,"failures":[]} On my virtual machine it took 34 seconds for 37,114 records. Update Kibana Index Pattern Now that we have a new index, we need to create a new index pattern in Kibana. Click on the Settings link in the navigation bar in Kibana. You should the Configure an index pattern screen. If you do not, click on the Indices link. You should see something similar to this: You should notice our existing Twitter* Index Pattern on the left. We need to create an index pattern for the twitter_new index we created with the Reindex API. Keep the Index contains time-based events option check this time. Enter twitter_new for the index name. You will noticed that Kibana auto populated the time-field name with created_at, which is what we want. Click the green Create button. As before, you will see the index definition from Elasticsearch. In the filter query box, enter coordinates. This will filter the fields and only display those fields that have coordinates in the field name. You should see something similar to this: Notice how the coordinates.coordinates field has type geo_point? Now filter on created_at. You should see something similar to this: Notice how created_at has a clock icon and the type is date. This tells us the new index has the updated field mappings. Now we can create some visualizations. Create Kibana Visualizations Now that our index is properly configured, we can create some visualizations. Click on the Visualize link in the navigation bar. You should see something similar to this: We are going to create a time-series chart. So click on Vertical bar chart. You should see something similar to this: Now click on From a new search. You should see something similar to this: Kibana knows you have two index patterns and is prompting you to specify which one to use for the visualization. In the Select an index pattern drop down, choose twitter_new. You should see something similar to this: The Y-Axis defaults to Count, which is fine. We need to tell Kibana what field to use for the X-Axis. Under Select buckets type click on X-Axis. You should see something similar to this: Under Aggregation, expand the drop down and select Date Histogram. Your screen should look like this: You will notice that it defaulted to using the created_at field which is the time-series field we selected for the index pattern. This is what we want. By default, Kibana will filter events to show only the last hour. In the upper right corner of the UI, click the Last 1h rounded to the hour text. This will expand the time picker. You should see something like this: Change the time option from 1 hours ago to 1 days ago. Click the gray Go button. Now we can save this visualization. Click the floppy disk icon to the right of the query box. You should see something similar to this: Give the visualization a name of Created At and click the gray Save button. Now we are going to create a map visualization. Click the Visualize link in the navigation Bar. You should see something like this: You should note our previously saved visualization is listed at the bottom. Now we are going to click on the Tile map option. You want to choose From a new search and use the index pattern twitter_new. You should have something that looks similar to this: Under the Select buckets type, click on Geo Coordinates. Kibana should auto populate the values with the only geo-enabled field we have, coordinates.coordinates. You should see something similar to this: Click the Green run arrow. You should see something similar to this: Now save the visualization, like we did the last one. Name it Twitter Location. Create Kibana Dashboard Now we ready to create a Kibana dashboard. Click on Dashboard in the navigation bar. You should see something similar to this: Click the + icon where it says Ready to get started?. You should see something similar to this: You should see the two visualizations listed that we saved. You add them to the dashboard by clicking the name of the visualization. Click each of the visualizations one time. You should see something similar to this: You can resize each of the dashboard tiles by dragging the lower right corner of each tile. Increase the size of each tile so they take up half of the vertical space. You will notice the tiles show a shaded area where they will auto size to the next closest size. You can click the ^ icon at the bottom of the visualization list to close it. You can do the same thing at the top for the date picker. You should have something similar to this: Now we can save this dashboard. Click the floppy icon to the right of the query box. Save the dashboard as " Twitter Dashboard". Leave the Store time with dashboard option unchecked. This stores the currently selected time, which is 1 day ago. We want the dashboard to default to the current last 1 hour of data when opening the dashboard. Review We have successfully walked through installing Kibana on our sandbox. We created a custom template for the Twitter data in Elasticsearch. We used the Elasticsearch Reindex API to copy our index with the new mappings. We created two visualizations and a dashboard. Next Steps Now go back to your NiFi data flow and update the configuration of your PutElasticsearch processor to use the new index twitter_new instead of twitter. Turn on your data flow and refresh your dashboard to see how it changes over time. For extra credit, see if you can create a dashboard that looks similar to this:

myoung · ‎09-13-2016

Objective: The purpose of this tutorial is to walk you through the process of using NiFi to pull data from Twitter and push it to Elasticsearch. I also show an example Zeppelin dashboard which queries the Twitter data in Elasticsearch. This is the second of two articles covering Elasticsearch on HDP. The first article covers manually creating Movie data in Zeppelin and pushing that data to Elasticsearch. You can find that article here: HCC Article Note: The Zeppelin Elasticserch interpreter is a community provided interpreter. It is not yet considered GA by Hortonworks and should only be used for development and testing purposes. Prerequisites: You should already have installed the Hortonworks Sandbox (HDP 2.5 Tech Preview). You should already have enabled the Elasticsearch interpreter in Zeppelin. See this article: HCC Article You should already have twitter access keys. You create your access keys here: Twitter Apps. Read more here: Twitter Docs. Note: While not required, I recommend using Vagrant to manage multiple versions of the Sandbox. Follow my tutorial here to set that up: HCC Article Scope: This tutorial was tested using the following environment and components: Mac OS X 10.11.6 HDP 2.5 Tech Preview on Hortonworks Sandbox Apache NiFi 1.0.0 Elasticsearch 2.3.5 and Elasticsearch 2.4.0 Steps: Download Elasticsearch We need to download Elasticsearch. The current version is 2.4.0. You can read more about Elasticsearch here: Elasticsearch Website You can use curl to download Elasticsearch to your sandbox. $ cd ~ $ curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.4.0/elasticsearch-2.4.0.tar.gz Install Elasticsearch Next we need to extract Elasticsearch to the /opt directory, which is where we'll run it. $ cd /opt $ sudo tar cvfz ~/elasticsearch-2.4.0.tar.gz Configure Elasticsearch We need to make a couple of changes to the Elasticsearch configuration file /opt/elasticsearch-2.4.0/config/elastiserach.yml. $ cd config $ vi elasticsearch.yml We need to set the cluster.name setting to "elasticsearch". This is the default Zeppelin expects, however you can change this value in the Zeppelin configuration. cluster.name: elasticsearch We need to set the network.host setting to our sandbox hostname or ip. Elastic will default to binding to 127.0.0.1 which won't allow us to easily access it from outside of the sandbox. network.host: sandbox.hortonworks.com Make sure you have removed the # character at the start of the line for these two settings. Once you have completed these two changes, save the file: Press the esc key !wq Create Elasticsearch user We are going to create an elastic user to run the application. $ sudo useradd elastic -d /home/elastic Change Ownership of Elasticserach directories We are going to change the ownership of the elastic directories to the elastic user: $ sudo chown -R elastic:elastsic /opt/elasticserach-2.4.0 Start Elasticsearch We want to run Elasticsearch as the elastic user so first we'll switch to that user. $ sudo su - elastic We want to run Elasticsearch as the elastic user so first we'll switch to that user. $ cd /opt/elasticsearch-2.4.0 $ bin/elasticsearch You will see something similar to : $ bin/elasticsearch [2016-09-02 19:44:34,905][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed [2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] version[2.4.0], pid[22983], build[ce9f0c7/2016-08-29T09:14:17Z] [2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] initializing ... [2016-09-02 19:44:35,807][INFO ][plugins ] [Skyhawk] modules [lang-groovy, reindex, lang-expression], plugins [], sites [] [2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] using [1] data paths, mounts [[/ (/dev/mapper/vg_sandbox-lv_root)]], net usable_space [26.2gb], net total_space [42.6gb], spins? [possibly], types [ext4] [2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] heap size [990.7mb], compressed ordinary object pointers [true] [2016-09-02 19:44:35,856][WARN ][env ] [Skyhawk] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536] [2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] initialized [2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] starting ... [2016-09-02 19:44:38,115][INFO ][transport ] [Skyhawk] publish_address {172.28.128.4:9300}, bound_addresses {172.28.128.4:9300} [2016-09-02 19:44:38,119][INFO ][discovery ] [Skyhawk] elasticsearch/31d3OvlZT5WRnqYUW-GJwA [2016-09-02 19:44:41,157][INFO ][cluster.service ] [Skyhawk] new_master {Skyhawk}{31d3OvlZT5WRnqYUW-GJwA}{172.28.128.4}{172.28.128.4:9300}, reason: zen-disco-join(elected_as_master, [0] joins received) [2016-09-02 19:44:41,206][INFO ][http ] [Skyhawk] publish_address {172.28.128.4:9200}, bound_addresses {172.28.128.4:9200} [2016-09-02 19:44:41,207][INFO ][node ] [Skyhawk] started [2016-09-02 19:44:41,223][INFO ][gateway ] [Skyhawk] recovered [0] indices into cluster_state Verify access to Elasticsearch Using your web browser, verify you get a response from Elasticsearch by using the following address: http://sandbox.hortonworks.com:9200/ You should see something similar to: Alternatively, you can use curl: $ curl -XGET http://sandbox.hortonworks.com:9200 You will see a similar json output message: $ curl -XGET http://sandbox.hortonworks.com:9200 { "name" : "Echo", "cluster_name" : "elasticsearch", "version" : { "number" : "2.4.0", "build_hash" : "ce9f0c7394dee074091dd1bc4e9469251181fc55", "build_timestamp" : "2016-08-29T09:14:17Z", "build_snapshot" : false, "lucene_version" : "5.5.2" }, "tagline" : "You Know, for Search" } Install OpenJDK 1.8 NiFi 1.0.0 requires JDK 1.8. You can install it on the sandbox using: $ sudo yum install java-1.8.0-openjdk $ sudo yum install java-1.8.0-openjdk-devel Download Nifi 1.0 We need to download NiFi. The latest version is 1.0.0. You can read more about NiFi here: NiFi Website You can use curl to download NiFi to your sandbox. $ cd ~ $ curl -O http://mirrors.ibiblio.org/apache/nifi/1.0.0/nifi-1.0.0-bin.tar.gz Note: You may want to use a mirror location closest to you by visiting: Apache Mirrors Install NiFi We need to extract NiFi to /opt directory, which is where we'll run it. $ cd /opt $ sudo tar xvfz ~/nifi-1.0.0-bin.tar.gz Configure Nifi We need to change the web port of NiFi. The default port is 8080, which will conflict with Ambari. We will change the port to 9090. $ cd nifi-1.0.0/conf $ vi nifi.properties Edit the nifi.web.http.port property to change the default port. nifi.web.http.port=9090 Save the file: Press the esc key !wq Create NiFi user We are going to create a nifi user to run the application. $ sudo useradd nifi -d /home/nifi Change Ownership of NiFi diretories We are going to change the ownership of the nifi directories to the nifi user: $ sudo chown -R nifi:nifi /opt/nifi-1.0.0 Start NiFi We want to run NiFi as the nifi user so first we'll switch to that user. $ sudo su - nifi $ cd /opt/nifi-1.0.0 $ bin/nifi.sh start You should see something similar to this: $ ./nifi.sh start Java home: /usr/lib/jvm/java NiFi home: /opt/nifi-1.0.0 Bootstrap Config File: /opt/nifi-1.0.0/conf/bootstrap.conf 2016-09-03 02:45:47,909 INFO [main] org.apache.nifi.bootstrap.Command Starting Apache NiFi... 2016-09-03 02:45:47,909 INFO [main] org.apache.nifi.bootstrap.Command Working Directory: /opt/nifi-1.0.0 2016-09-03 02:45:47,909 INFO [main] org.apache.nifi.bootstrap.Command Command: /usr/lib/jvm/java/bin/java -classpath /opt/nifi-1.0.0/./conf:/opt/nifi-1.0.0/./lib/jul-to-slf4j-1.7.12.jar:/opt/nifi-1.0.0/./lib/nifi-documentation-1.0.0.jar:/opt/nifi-1.0.0/./lib/logback-core-1.1.3.jar:/opt/nifi-1.0.0/./lib/nifi-runtime-1.0.0.jar:/opt/nifi-1.0.0/./lib/slf4j-api-1.7.12.jar:/opt/nifi-1.0.0/./lib/nifi-properties-loader-1.0.0.jar:/opt/nifi-1.0.0/./lib/jcl-over-slf4j-1.7.12.jar:/opt/nifi-1.0.0/./lib/log4j-over-slf4j-1.7.12.jar:/opt/nifi-1.0.0/./lib/bcprov-jdk15on-1.54.jar:/opt/nifi-1.0.0/./lib/nifi-framework-api-1.0.0.jar:/opt/nifi-1.0.0/./lib/nifi-nar-utils-1.0.0.jar:/opt/nifi-1.0.0/./lib/nifi-properties-1.0.0.jar:/opt/nifi-1.0.0/./lib/nifi-api-1.0.0.jar:/opt/nifi-1.0.0/./lib/commons-lang3-3.4.jar:/opt/nifi-1.0.0/./lib/logback-classic-1.1.3.jar -Dorg.apache.jasper.compiler.disablejsr199=true -Xmx512m -Xms512m -Dsun.net.http.allowRestrictedHeaders=true -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -XX:+UseG1GC -Djava.protocol.handler.pkgs=sun.net.www.protocol -Dnifi.properties.file.path=/opt/nifi-1.0.0/./conf/nifi.properties -Dnifi.bootstrap.listen.port=43343 -Dapp=NiFi -Dorg.apache.nifi.bootstrap.config.log.dir=/opt/nifi-1.0.0/logs org.apache.nifi.NiFi Access NiFi Web UI Note: It will take a couple of minutes for NiFi to start up. The NiFi web UI should be available via http://sandbox.hortonworks.com:9090/nifi. You should see the default canvas: Add Processors You add processors to NiFi by dragging the Processor icon from the left icon bar to the canvas. This screenshot shows the Processor icon: Once you drag the icon to the canvas, you will see the Add Processor dialog box. This screenshot shows the dialog box: Add GetTwitter Processor Drag the Processor icon to the canvas. In the Add Processor dialog, enter "twitter" in the Filter box. This will filter the list of processors to show you matching processor names. You should see something similar to this: Select the GetTwitter processor and click the ADD button. This will add the processor to the canvas. Your canvas should look similar to this: Configure GetTwitter Processor Right click on the GetTwitter processor to display the context menu. It should look similar to this: Click the Configure menu option. This will display the Configure Processor dialog box. It should look similar to this: Click the Properties tab. You should set the following settings: Twitter Endpoint -> Set to Filter Endpoint Consumer Key -> From twitter app Consumer Secret -> From twitter app Access Token -> From twitter app Access Token Secret -> From twitter app Terms to Filter on -> Set to "nifi,hadoop,hortonworks,elasticsearch" Once the settings are correct, click the APPLY button. Add PutElasticsearch Processor Drag the Processor icon to the canvas. In the Add Processor dialog, enter "elastic" in the Filter box. This will filter the list of processors to show you matching processor names. You should see something similar to this: Select the PutElasticsearch processor and click the ADD button. This will add the processor to the canvas. Your canvas should look similar to this: Configure PutElasticsearch Processor Right click on the PutElasticsearch processor to display the context menu. It should look similar to this: Click the Configure menu option. This will display the Configure Processor dialog box. It should look similar to this: Under Auto Terminate Relationships, check the success box. The failure and retry boxes should not be checked and will be handled via connection later. Now click the Properties tab. You should see something similar to this: Set the following settings: Cluster Name -> Set to elasticsearch (this should match the cluster name in the elasticsearch.yml configuration file) ElasticSearch Hosts -> Set to sandbox.hortonworks.com:9300 (note the port is 9300, not 9200) Identifier Attribute -> Set to uuid (this uses a unique id generated by Elasticsearch) Index -> Set this to twitter (this can be any index name you want, but need to know it for Zeppelin queries) Type -> Set this to default (we are not using types, so can be any type name you want) All other settings left at defaults Once the settings are correct, click the APPLY button. Connect Processors Now we need to create a connector between our processors. Hover over the GetTwitter processor. You should see a dark circle with a white arrow in the middle. Drag this icon down over top of the PutElasticearch processor. The Create Connection dialog should open. It should look similar to this: You can click the ADD button, as you don' t need to make any changes. Now hover over the PutElasticsearch processor. Drag the arrow icon out past the PutElasticsearch processor, then back over top of it and release. This will display the Create Connection dialog. This connection is needed for fail or retry operations. Select the failure and retry options under Relationships. It should look similar to this: Click the ADD button. Start Processors Right click on the GetTwitter processor to display the context menu. Click the Start option. This will start the processor. Right click on the PutElasticsearch processor to display the context menu. Click the Start option. This will start the processor. Verify Workflow The two processors should be running. You should see something similar to this: You can verify that tweets are being written to elasticsearch by typing the following in a browser window: http://sandobx.hortonworks.com:9200/twitter/_search?pretty You should see something similar to this: Create Zeppelin Notebook Now you can create a Zeppelin notebook to query Elasticsearch. If you follow the Zeppelin article I linked in the Prerequisites, you should be able to use the %elasticsearch interpreter. Here is an example dashboard I created against Twitter data. You should notice that I'm using Elasticsearch DSL to run aggregation queries agains the data. Aggregations are ideally suited for charts and graphs. Troubleshooting If you see a red icon in the upper right of your processors, that indicates there is a problem. If you hover over the icon, you should see relevant error message information. Here is an example screenshot where I incorrectly set my port to 9200 instead of 9300 on the PutElasticsearch processor: Review: This tutorial walked you through installing Apache NiFi and Elasticsearch. You made the necessary configuration changes so that NiFi and Elasticsearch would run. You created a NiFi workflow using the GetTwitter and PutElasticsearch processors. The processors should have successfully pulled data from Twitter and pushed data to Elasticsearch. Finally, you should have been able to query Elasticsearch using Zeppelin.

myoung · ‎09-03-2016

Objective: The purpose of this tutorial is to walk you through the process of enabling the Elasticsearch interpreter for Zeppelin on the HDP 2.5 TP sandbox. As part of this process, we will install Elasticsearch and use Zeppelin to index and query data using Zeppelin and Elasticsearch. This is the first of two articles covering Elasticsearch on HDP. The second article covers pushing Twitter data to Elasticsearch using NiFi and provides a sample Zeppelin dashboard. You can find that article here: HCC Article Note: The Zeppelin Elasticserch interpreter is a community provided interpreter. It is not yet considered GA by Hortonworks and should only be used for development and testing purposes. Prerequisites: You should already have installed the Hortonworks Sandbox (HDP 2.5 Tech Preview). Note: While not required, I recommend using Vagrant to manage multiple versions of the Sandbox. Follow my tutorial here to set that up: HCC Article Scope: This tutorial was tested using the following environment and components: Mac OS X 10.11.6 HDP 2.5 Tech Preview on Hortonworks Sandbox Elasticsearch 2.3.5 and Elasticsearch 2.4.0 Note: This has also been tested on HDP 2.5 deployed with Cloudbreak on AWS. The specific steps may vary depending on your environment, but the high level process is the same. Steps: Here is the online documentation for the Elasticsearch interpreter for Zeppelin: Elasticseach Interpreter. If you follow the steps provided in this documentation, you will find that adding the Elasticserch interpreter is not possible as the documentation shows. That is because the interpreter is not enabled. If you try to add the interpreter, you will see it is not in the list. You should see something similar to: Verify Elasticsearch Interpreter is available The first thing we are going to do is ensure the Elasticsearch interpreter is available within the Zeppelin installation. You can verify the Elasticsearch intepreter is available by looking in the interpreter directory: $ ls -la /usr/hdp/current/zeppelin-server/interpreter/ total 76 drwxr-xr-x 19 zeppelin zeppelin 4096 2016-06-24 00:00 . drwxr-xr-x 8 zeppelin zeppelin 4096 2016-08-31 02:57 .. drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-23 23:59 alluxio drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-23 23:59 angular drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 cassandra drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 elasticsearch drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 file drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 flink drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 hbase drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 ignite drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 jdbc drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 kylin drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 lens drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 livy drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 md drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 psql drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 python drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 sh drwxr-xr-x 3 zeppelin zeppelin 4096 2016-06-24 00:00 spark Note: This process is easy on the sandbox. If you are using a different HDP environment, then you need to perform this step on the server on which Zeppelin is installed. If you do not see a directory for elasticsearch, you may have to run an interpreter install script. Here are the steps to run the interpreter install script: $ cd /usr/hdp/current/zeppelin-server/bin $ sudo ./install-interpreter.sh --name elasticsearch Add Elasticsearch Interpreter to the Zeppelin configuration Now we need to add the Elasticsearch interpreter to the Zeppelin configuration, which enables access to it. You need to modify the zeppelin.interpreters parameter. Click on the Zeppelin Notebook service in Ambari: Now, click on the Configs link: Expand Advanced zeppelin-config: Add the following string to the end of the zeppelin.interpreters parameter: ,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter Note: The comma is not a typo. It is required to seperate our added value from the previous value. It should look similar to this: Now click the Save button to save the settings. You should see an indication that you need to restart the Zeppelin service. It should look similar like this: Restart the Zeppelin Notebook service. Configure Zeppelin Interpreter Now you should be able to follow the documentation I linked previously for setting up the Elasticsearch interpreter. You should have something similar to this: The elasticsearch.host value will correspond to your ip address or sandbox.hortonworks.com if you have edited your local /etc/hosts file. Download Elasticsearch Now that Zeppelin is configured, we need to download Elasticsearch. The latest version is 2.4.0. You can read more about Elasticsearch here: Elasticsearch Website You can use curl to download Elasticsearch to your sandbox. $ cd ~ $ curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.4.0/elasticsearch-2.4.0.tar.gz Note: If you are using vagrant, you are able to download the file on your local computer and simply copy it to your Vagrant directory. The file will be visible within the sandbox in the /vagrant directory. Install Elasticsearch Next we need to extract Elasticsearch to /opt directory, which is where we'll run it. $ cd /opt $ sudo tar xvfz ~/elasticsearch-2.4.0.tar.gz Configure Elasticsearch We need to make a couple of changes to the Elasticsearch configuration file /opt/elasticsearch-2.4.0/config/elastiserach.yml. $ cd elasticsearch-2.4.0/config $ vi elasticsearch.yml We need to set the cluster.name setting to "elasticsearch". This is the default Zeppelin expects, however you can change this value in the Zeppelin configuration. cluster.name: elasticsearch We need to set the network.host setting to our sandbox hostname or ip. Elastic will default to binding to 127.0.0.1 which won't allow us to easily access it from outside of the sandbox. network.host: sandbox.hortonworks.com Make sure you have removed the # character at the start of the line for these two settings. Once you have completed these two changes, save the file: Press the esc key !wq Create Elasticsearch user We are going to create an elastic user to run the application. $ sudo useradd elastic -d /home/elastic Change Ownership of Elasticserach diretories We are going to change the ownership of the elastic directories to the elastic user: $ sudo chown -R elastic:elastsic /opt/elasticserach-2.4.0 Start elasticsearch We want to run Elasticsearch as the elastic user so first we'll switch to that user. $ sudo su - elastic $ cd /opt/elasticsearch-2.4.0 $ bin/elasticsearch You will see something similar to : $ bin/elasticsearch [2016-09-02 19:44:34,905][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed [2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] version[2.4.0], pid[22983], build[ce9f0c7/2016-08-29T09:14:17Z] [2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] initializing ... [2016-09-02 19:44:35,807][INFO ][plugins ] [Skyhawk] modules [lang-groovy, reindex, lang-expression], plugins [], sites [] [2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] using [1] data paths, mounts [[/ (/dev/mapper/vg_sandbox-lv_root)]], net usable_space [26.2gb], net total_space [42.6gb], spins? [possibly], types [ext4] [2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] heap size [990.7mb], compressed ordinary object pointers [true] [2016-09-02 19:44:35,856][WARN ][env ] [Skyhawk] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536] [2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] initialized [2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] starting ... [2016-09-02 19:44:38,115][INFO ][transport ] [Skyhawk] publish_address {172.28.128.4:9300}, bound_addresses {172.28.128.4:9300} [2016-09-02 19:44:38,119][INFO ][discovery ] [Skyhawk] elasticsearch/31d3OvlZT5WRnqYUW-GJwA [2016-09-02 19:44:41,157][INFO ][cluster.service ] [Skyhawk] new_master {Skyhawk}{31d3OvlZT5WRnqYUW-GJwA}{172.28.128.4}{172.28.128.4:9300}, reason: zen-disco-join(elected_as_master, [0] joins received) [2016-09-02 19:44:41,206][INFO ][http ] [Skyhawk] publish_address {172.28.128.4:9200}, bound_addresses {172.28.128.4:9200} [2016-09-02 19:44:41,207][INFO ][node ] [Skyhawk] started [2016-09-02 19:44:41,223][INFO ][gateway ] [Skyhawk] recovered [0] indices into cluster_state Verify access to Elasticsearch Using your web browser, verify you get a response from Elasticsearch by using the following address: http://sandbox.hortonworks.com:9200 You should see something similar to: Alternatively, you can use curl: curl -XGET http://sandbox.hortonworks.com:9200 You will see a similar json output message. Add data to elasticsearch Now we are going to create a notebook in Zeppelin. You should have a note for each index operation in the notebook. Let's use the %elasticsearch and the index command to index some data: %elasticsearch index movies/default/1 { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] } %elasticsearch index movies/default/2 { "title": "Lawrence of Arabia", "director": "David Lean", "year": 1962, "genres": ["Adventure", "Biography", "Drama"] } %elasticsearch index movies/default/3 { "title": "To Kill a Mockingbird", "director": "Robert Mulligan", "year": 1962, "genres": ["Crime", "Drama", "Mystery"] } %elasticsearch index movies/default/4 { "title": "Apocalypse Now", "director": "Francis Ford Coppola", "year": 1979, "genres": ["Drama", "War"] } %elasticsearch index movies/default/5 { "title": "Kill Bill: Vol. 1", "director": "Quentin Tarantino", "year": 2003, "genres": ["Action", "Crime", "Thriller"] } %elasticsearch index movies/default/6 { "title": "The Assassination of Jesse James by the Coward Robert Ford", "director": "Andrew Dominik", "year": 2007, "genres": ["Biography", "Crime", "Drama"] } You should have a notebook that looks similar to this: For each of the index notes, click the play button to insert the data. Query Elasticsearch data Once the data is in Elasticseach, we can search using Zeppelin like this: %elasticsearch search /movies/default For this note, click the play button to run the query. You should see something similar to this: The Elasticsearch interpreter has great support for the Elasticsearch Query DSL (Domain Specific Language). You have the ability to easily filter the fields returned, create buckets and aggregations. Review: We have enabled the Elasticsearch interpreter in Zeppelin, indexed data into Elasticsearch and queried data from Elasticsearch using Zeppelin. Try indexing and querying data using your own data and using a different index name.

myoung · ‎08-22-2016

@Binu Mathew I am curious how https://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html would compare to the Pig RegEx approach.

myoung · ‎08-20-2016

@gkeys This is a great article and filled with helpful tips!

myoung · ‎08-18-2016

Objective The objective of this tutorial is to walk you through process of updating your Cloudbreak Deployer on Amazon AWS from 1.3 to 1.4. Once this is complete, you can now deploy an HDP 2.5 TP cluster. Prerequisites You should have deployed Cloudbreak 1.3 on Amazon AWS using the instructions found here: Cloudbreak Documentation 1.3 - AWS. You should add TCP port 443 to the security group on Amazon AWS as Cloudbreak 1.4+ appears to proxy requests from port 443 now. Do not run cbd start. Note: This process will update the Cloudbreak Deployer to the latest available version. At the time of initial writing, this was 1.4. As of September 1, 2016, the latest version is 1.6. Steps 1. Connect to your Cloudbreak Amazon AWS instance You should have access to your key file from amazon. Log into your Cloudbreak deployer instance using: ssh -i <amazon key file> cloudbreak@<amazon instance public ip> Note: If you have permission issues connecting via ssh, make sure you set your key file permissions to 0600 2. All commands should be run from the Cloudbreak Deployer root directory: cd /var/lib/cloudbreak-deployer 3. Before you can start Cloudbreak, you need to initialize the environment by running cbd init. You should see something similar to: $ cbd init ===> Deployer doctor: Checks your environment, and reports a diagnose. uname: Linux ip-172-31-15-160.ec2.internal 3.10.0-327.10.1.el7.x86_64 #1 SMP Sat Jan 23 04:54:55 EST 2016 x86_64 x86_64 x86_64 GNU/Linux local version:1.3.0 latest release:1.3.0 docker images: docker command exists: OK docker client version: 1.9.1 docker client version: 1.9.1 ping 8.8.8.8 on host: OK ping github.com on host: OK ping 8.8.8.8 in container: OK ping github.com in container: OK Note: If you previously ran cbd start, then you should run cbd kill before upgrading Cloudbreak. 4. If the initialization completed successfully, now you can update Cloudbreak to version 1.4 using cbd update master && cbd regenerate && cbd pull-parallel. You should see something similar to: $ cbd update master && cbd regenerate && cbd pull-parallel Update /usr/bin/cbd from: https://x.example.com/0//tmp/circle-artifacts.VMI9cYT/cbd-linux.tgz mv: try to overwrite '/usr/bin/cbd', overriding mode 0755 (rwxr-xr-x)? y * removing old docker-compose binary * Dependency required, installing docker-compose 1.7.1 ... Generating Cloudbreak client certificate and private key in /var/lib/cloudbreak-deployment/certs. generating docker-compose.yml /tmp/bashenv.793850575: line 674: /var/lib/cloudbreak-deployment/.deps/tmp/uaa-delme.yml: No such file or directory diff: /var/lib/cloudbreak-deployment/.deps/tmp/uaa-delme.yml: No such file or directory renaming: uaa.yml to: uaa-20160817-135202.yml generating uaa.yml latest: Pulling from catatnight/postfix d64336e52f9a: Pulling fs layer be760b6bdfc8: Pulling fs layer 2bed4b6dfef0: Pulling fs layer d64336e52f9a: Downloading 3.784 MB/67.5 MB be760b6bdfc8: Download complete 2bed4b6dfef0: Download complete d64336e52f9a: Downloading 11.32 MB/67.5 MB 9e3e55be1c1f: Download complete 089c741cddd5: Download complete d64336e52f9a: Downloading 12.4 MB/67.5 MB 0d03ba7124d6: Download complete 7ab5db5b9418: Downloading 5.93 MB/20.09 MB d0f7cd1223d0: Downloading 1.506 MB/14.72 MB a7aff224adfb: Download complete 7ab5db5b9418: Downloading 14.66 MB/20.09 MB d0f7cd1223d0: Downloading 10.67 MB/14.72 MB Digest: sha256:028b5f6f49d87a10e0c03208156ffedcef6da2e7f59efa8886640ba15cbe0e69 7ab5db5b9418: Downloading 15.52 MB/20.09 MB d0f7cd1223d0: Downloading 11.55 MB/14.72 MB Digest: sha256:87cf35f319f40f657a68e21e924dd5ba182d8253005c86a116992f2e17570765 Status: Image is up to date for gliderlabs/registrator:v5 v1.0.0: Pulling from library/traefik cb056548b9bb: Pulling fs layer 1.0.0: Pulling from sequenceiq/socat 1.2.0: Pulling from sequenceiq/cbdb 0ec8b08ed2db: Pulling fs layer e052a65b2e55: Pulling fs layer 462cae4bc514: Pulling fs layer f0adfc577336: Pulling fs layer 1.4.0: Pulling from hortonworks/cloudbreak-web 88d2a45c4dd8: Pulling fs layer d66b5ca0a6a8: Pulling fs layer d30d76b6140e: Pulling fs layer a339cf4fec1c: Pulling fs layer 1b32f6c5164a: Pulling fs layer dd079b923689: Pulling fs layer ef55927d23dc: Pulling fs layer 09097ea804bd: Pulling fs layer ff9a6f48abb7: Pulling fs layer Digest: sha256:8e2ec7a47b17ff50583e05224ca1243ed188aff8087bb546e406effb82b691fe Status: Image is up to date for sequenceiq/socat:1.0.0 1.4.0: Pulling from hortonworks/cloudbreak-auth 88d2a45c4dd8: Pulling fs layer 17611781a601: Pulling fs layer 9777e7c06cfb: Pulling fs layer d64336e52f9a: Downloading 15.65 MB/67.5 MB 033171f95048: Pulling fs layer 7ab5db5b9418: Downloading 16.37 MB/20.09 MB d0f7cd1223d0: Downloading 11.99 MB/14.72 MB 3311ad4fbcb0: Pulling fs layer 1e870254f4fa: Pulling fs layer 13d69d98d1f7: Pulling fs layer d64336e52f9a: Downloading 16.73 MB/67.5 MB 7ab5db5b9418: Download complete d0f7cd1223d0: Download complete d64336e52f9a: Downloading 17.27 MB/67.5 MB ad3025da7360: Pulling fs layer v2.7.1: Pulling from sequenceiq/uaadb 1.4.0: Pulling from sequenceiq/periscope d34921bc2709: Pulling fs layer 7062b3d97728: Pulling fs layer 767584930cea: Pulling fs layer c05d09cea848: Pulling fs layer 597fe94dd548: Pulling fs layer Digest: sha256:270e87a90add32c69d8cb848c7455256f4c0a73e14a9ba2c9335b11853f688a6 Status: Image is up to date for sequenceiq/uaadb:v2.7.1 2.7.1: Pulling from sequenceiq/uaa 1.1: Pulling from sequenceiq/haveged 1.4.0: Pulling from sequenceiq/cloudbreak d34921bc2709: Pulling fs layer 7062b3d97728: Pulling fs layer 767584930cea: Pulling fs layer c05d09cea848: Pulling fs layer d64336e52f9a: Downloading 17.81 MB/67.5 MB c277e7f5e8b7: Pulling fs layer a30f653c4d56: Pulling fs layer d64336e52f9a: Downloading 18.35 MB/67.5 MB 7f0c2637ebf6: Pulling fs layer 9e1df59c970a: Pulling fs layer d64336e52f9a: Downloading 19.97 MB/67.5 MB 6f980013dd43: Pulling fs layer 5f32c66af8ea: Pulling fs layer eacee569f539: Pulling fs layer a4c72beb2675: Pulling fs layer 1.2.0: Pulling from sequenceiq/pcdb Digest: sha256:a64d40d0d51b001d2e0cb8490fcf04da59e0c8ede5121038a175d9bf2374cb6a Status: Image is up to date for sequenceiq/haveged:1.1 0123c5510cfa: Pulling fs layer Digest: sha256:361163496cde9183235355b6d043908c96af57a56db4b7d7b2cf40e255026716 Status: Image is up to date for sequenceiq/uaa:2.7.1 c277e7f5e8b7: Pulling fs layer 447edeb914d3: Pulling fs layer e75814ea06f9: Pulling fs layer 6b4d47a92a9b: Pulling fs layer Extracting 11.01 MB/18.53 MBExtracting 11.01 MB/18.53 MB 35cab74c8aa7: Downloading 8.466 MB/10.42 MB c05d09cea848: 71 MB/42.5 MB c05d09cea848: Downloading 42.18 MB/42.5 MB Downloading 16.71 MB/18.53 MBownloading 8.623 MB/108.1 MB d34921bc2709: Pull complete 7062b3d97728: Pull complete 767584930cea: Pull complete 767584930cea: Pull complete c05d09cea848: Download complete c05d09cea848: Pull complete 597fe94dd548: Pull complete c277e7f5e8b7: Pull complete c277e7f5e8b7: Pull complete Downloading 50.25 MB/108.1 MB07 MB/130.1 MB 6b4d47a92a9b: Downloading 5.14 MB/10.42 MB Downloading 2.741 MB/10.42 MBownloading 7.001 MB/13.19 MB a30f653c4d56: Pull complete Extracting 117.5 MB/130.1 MBExtracting 117.5 MB/130.1 MB 5cb6cc1fb08d: Pull complete eacee569f539: Extracting 284.4 kB/284.4 kB d66b5ca0a6a8: Pull complete Extracting 16.Extracting 24.71 MB/42.5 MB 7f0c2637ebf6: Pull complete 9e1df59c970a: Pull complete 42396e8dcbcd: Pull complete 6f980013dd43: Extracting 32 B/32 B 6f980013dd43: Extracting 32 B/32 B 6f980013dd43: Pull complete 5f32c66af8ea: Pull complete eacee569f539: Pull complete a4c72beb2675: Pull complete 447edeb914d3: Pull complete e75814ea06f9: Pull complete 6b4d47a92a9b: Pull complete 4b4f74f41ebf: Pull complete be602741d584: Pull complete 07b4015931e0: Pull complete 5cab237e98a9: Pull complete 3a12055ee388: Extracting 32 B/32 B 73fb6d32d5e3: Pull complete 3a12055ee388: Pull complete 18afddb9bf55: Pull complete Digest: sha256:8085718c474c40ce4dcc5f64b9ccf23a3f91b3cb2f7fe2e8572fc549a25e6953 Status: Downloaded newer image for sequenceiq/cloudbreak:1.4.0 5. Once the upgrade process is complete, start Cloudbreak using cbd start. You should see something similar to: $ cbd start generating docker-compose.yml generating uaa.yml Creating cbreak_haveged_1... Creating cbreak_uluwatu_1... Creating cbreak_cbdb_1... Creating cbreak_consul_1... Creating cbreak_cloudbreak_1... Creating cbreak_registrator_1... Creating cbreak_pcdb_1... Creating cbreak_periscope_1... Creating cbreak_sultans_1... Creating cbreak_uaadb_1... Creating cbreak_logsink_1... Creating cbreak_logspout_1... Creating cbreak_identity_1... Uluwatu (Cloudbreak UI) url: http://54.164.138.139:3000 login email: admin@example.com password: cloudbreak 6. Login to the Cloudbreak UI. Note: As I mentioned in the prerequisites, Cloudbreak appears to proxy requests from port 443 now. The url to access the Cloudbreak UI will be https://<amazon cloudbreak instance ip> As of Cloudbreak 1.6, the properly URL is displayed for the UI. 7. Create a platform definition. This is done by expand the manage platforms area of the Cloudbreak UI. We are using AWS, so create a platform by selecting AWS. The UI will look similar to this: You can provide any Name and Description you like. 8. Create a credential. This is done by expand the manage credentials area of the Cloudbreak UI. We are using AWS, so select AWS. For ease of configuration, change the AWS Credential Type to Key Based. The Select Platform option should be set to the platform you created in the previous step. The UI will look similar to this: You can provide any Name and Description you like. The Access Key and Secret Access Key are from your Amazon account. See this documentation to setup an access key: AWS Credentials. The SSH Public Key is found in the /home/cloudbreak/.ssh/id_rsa.pub file that was created when you followed the steps for creating the Cloudbreak instance. You can see the key by using cat on the file like this: $ cat /home/cloudbreak/.ssh/id_rsa.pub Note: Remember to download your credentials from Amazon. If you forget this step, there is no way to determine your Secret Access Key. You will have to delete those credentials and create new ones. 9. Once the credential is created, you need to select it. 10. Now we will create our own blueprint by copying one of the existing ones. We are doing this to deploy HDP 2.5. The default blueprints will currently deploy HDP 2.4. Note: As of Cloudbreak 1.6, the default version is HDP 2.5 so this step is not necessary. Expand the manage blueprints section of the UI. The UI will look similar to this: Now select the hdp-small-default blueprint. We will copy this for our HDP 2.5 blueprint. The UI will look similar to this: Click the copy & edit button to create a copy of the blueprint. The UI will look similar to this: You can provide any Name and Description that you like. In the JSON Text field, scroll down to the bottom. Change the "stack.version": "2.4" to "stack.version": "2.5". Click the green create blueprint button. 11. Now you can create your cluster. click the green Create cluster button. Provide a Cluster Name and select the appropriate AWS Region. Click the Setup Network and Security button. You don't need to change anything here. Click the Choose Blueprint button. Select the Blueprint we created in the previous steps. You will notice there is an Ambari Server check box on each of the servers listed. You need to determine where you want to deploy Ambari. Select the checkbox for that server. The UI will look similar to this: Click the Review and Launch button. This will provide a final confirmation screen with a summary of the cluster. If everything looks good, click the green create and start cluster button. 12. Cloudbreak will now start creating the cluster. The UI will look similar to this: 13. If you click on the test1 cluster name, you can see more information on the cluster. The UI will look similar to this: 14. Once the cluster build is complete, the UI should look similar to this: You can see more information during the cluster build process by expanding the Event History section. The UI will look similar to this: 15. Once the cluster build is complete, you can log into Ambari using the Ambari Server Address link provided. 16.Once you are logged in to Ambari, select the Stacks and Versions view. 17. You can see by the components listed there are new HDP 2.5 components like Log Search and Spark2. You should see something similar to: 18 And finally, you can see HDP version by click the Versions tab. You should see something similar to: Review We successfully upgraded Cloudbreak 1.3 to Cloudbreak 1.4 on Amazon AWS. Using Cloudbreak 1.4, we were able to clone a blueprint, change the stack version to 2.5 and deploy a HDP 2.5 TP cluster.

Online	Offline
Last Visited	‎02-08-2019 07:03 PM

Member Since	‎02-09-2016 09:44 PM
Last Visited	‎02-08-2019 07:03 PM
Posts	559
Kudos received	405

Cloudera Community

Creating a 3 node NiFi cluster using Vagrant and V...

Creating a Process Group for Twitter Data in NiFi

Using NiFi GetTwitter, UpdateAttributes and Replac...

How to start Atlas on Hortonworks Sandbox for HDP ...

Creating a Kibana dashboard of Twitter data pushed...

How to pull data from Twitter and push data to Ela...

Enabling the Zeppelin Elasticsearch interpreter

Re: Hive on Tez vs PySpark for weblogs parsing

Re: Pig Doing Yoga: How to Build Superflexible Pig...

How to test HDP 2.5 TP using Cloudbreak 1.4 on Ama...