Created on 07-06-2017 08:34 AM
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial, we are going to create a Falcon cluster by :
Once you have downloaded the Hortonworks Sandbox and run the VM, navigate to the Ambari interface on the port 8080
of the IP address of your Sandbox VM. Login with the username of admin
and the password that you set for the Ambari admin user. You should have a similar image as below:
In this tutorial, we are going to create a Falcon cluster so that we can configure data pipelines and then perform the feed management services such as feed retention, data replication across clusters and archival. This tutorial is the starting point of all Falcon tutorials where we create two cluster entities which define where the data and the processes for your data pipeline are stored. Allow yourself 1 quality hour to complete this tutorial.
By default, Falcon is not started on the Sandbox. You can start the Falcon service from Ambari by clicking on the Falcon icon in the left hand pane:
Then click on the Service Actions
button on the top right:
Then click on Start
:
Once Falcon starts, Ambari should clearly indicate as below that the service has started:
Click the button at the top of the Ambari screen with admin menu drop down and click Manage Ambari
.
Click the blue Users
button in the bottom box as given below:
Click the Create Local User
button at the top of the page.
Enter falcon
as the user name and then set the password for it. Enter it again for confirmation and Save
the user.
You can see the newly added falcon
user. Click on it to assign it a group so that it can access Ambari views.
Write "views"
and select it in Local Group Membership
box and then click on tick
mark to add a falcon user in the "views"
group.
Now logout of Ambari from the admin user and login to Ambari as falcon user.
Select the Files View like given below:
The Files View Interface will appear with the following default folders.
We need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster
and backupCluster
.
Navigate to /apps/falcon
folder, click the New Folder
button , an add new folder window appears and name the folder primaryCluster. Press enter or Add
Similarly, create another folder called backupCluster
, you will see your new directories created successfully:
Click on the row of primaryCluster
directory and then click on Permissions
button:
Add Write
permission for both Group and Others and then click Save
.
Do the same for backupCluster
directory. Now navigate down into the primaryCluster
directory and create two new directories: staging
and working
. Click on the row for the staging
directory and add Write
permission for both Group and Others. Refresh the page and then navigate to /apps/falcon/primaryCluster
to see the changes:
Repeat the same steps for backupCluster
. Create two directories- staging
and working
and then assign Write
permission in staging
directory for Group and Others.
<!— First SSH into the Hortonworks Sandbox with the command:
<code>ssh root@127.0.0.1 -p 2222
The default password is hadoop
. If you have changed it earlier, then enter the new one.
We need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster
and backupCluster
.
First, from the command line, check whether the Falcon server is running or not. Switch the user to Falcon using:
<code>su - falcon
Change the directory to your HDP version:
<code>cd /usr/hdp/current/falcon-server
And run the below script to find the status of Falcon server:
<code>./bin/falcon-status
Next, use hdfs dfs -mkdir
commands to create the directories /apps/falcon/primaryCluster
and /apps/falcon/backupCluster
on HDFS.
<code> hdfs dfs -mkdir /apps/falcon/primaryCluster
hdfs dfs -mkdir /apps/falcon/backupCluster
Further create directories called staging
inside each of the directories we created above:
<code> hdfs dfs -mkdir /apps/falcon/primaryCluster/staging
hdfs dfs -mkdir /apps/falcon/backupCluster/staging
Next, create the working
directories for primaryCluster
and backupCluster
:
<code> hdfs dfs -mkdir /apps/falcon/primaryCluster/working
hdfs dfs -mkdir /apps/falcon/backupCluster/working
Finally you need to set the proper permissions on the staging/working directories:
<code>hdfs dfs -chmod 777 /apps/falcon/primaryCluster/staging
hdfs dfs -chmod 755 /apps/falcon/primaryCluster/working
hdfs dfs -chmod 777 /apps/falcon/backupCluster/staging
hdfs dfs -chmod 755 /apps/falcon/backupCluster/working
–>
Let’s open the Falcon Web UI. You can navigate to the Falcon Web UI directly on the browser. Type 127.0.0.1:15000
. The Falcon UI is by default at port 15000. The default username is ambari-qa
.
<!—
You can easily launch the Falcon Web UI from Ambari:
Navigate to the Falcon Summary page and click Quick Links>Falcon Web UI
.
–>
This UI allows us to create and manage the various entities like Cluster, Feed, Process and Mirror. Each of these entities are represented by an XML file that you either directly upload or generate by filling out the various fields. You can also search for existing entities and then edit, change state, etc.
Let’s first create a couple of cluster entities. To create a cluster entity click on the Create
dropdown,
Click Cluster
on the top.
NOTE : If you want to create it from XML, skip the wizard section, and move on to the next one.
A cluster entity defines the default access points for various resources on the cluster as well as default working directories to be used by Falcon jobs.
To define a cluster entity, we must specify a unique name by which we can identify the cluster. In this tutorial, we use:
<code>primaryCluster
Next enter a data center name or location of the cluster and a description for the cluster. The data center name can be used by Falcon to improve performance of jobs that run locally or across data centers. Mention primaryColo
in Colo and this is primary cluster
in description.
All entities defined in Falcon can be grouped and located using tags. To clearly identify and locate entities, we assign the tag:
<code>EntityType
With the value
<code>Cluster
Next, we enter the URI for the various resources Falcon requires to manage data on the clusters. These include the NameNode dfs.http.address, the NameNode IPC address used for file system metadata operations, the Yarn client IPC address used for executing jobs on Yarn, the Oozie address used for running Falcon Feeds and Processes, and the Falcon messaging address. The values we will use are the defaults for the Hortonworks Sandbox; if you run this tutorial on your own test cluster, modify the addresses to match those defined in Ambari:
<code>Namenode DFS Address - hftp://sandbox.hortonworks.com:50070
File System Default Address - hdfs://sandbox.hortonworks.com:8020
YARN Resource Manager Address - sandbox.hortonworks.com:8050
Workflow Address - http://sandbox.hortonworks.com:11000/oozie/
Message Broker Address - tcp://sandbox.hortonworks.com:61616?daemon=true
You can also override cluster properties for a specific cluster. This can be useful for test or backup clusters which may have different physical configurations. In this tutorial, we’ll just use the properties defined in Ambari. After the resources are defined, you must define default staging, temporary, and working directories for use by Falcon jobs based on the HDFS directories you created earlier in the tutorial. These can be overridden by specific jobs, but will be used in the event no directories are defined at the job level. In the current version of the UI, these directories must exist, be owned by Falcon, and have the proper permissions.
<code>Staging* - /apps/falcon/primaryCluster/staging
Temp* - /tmp
Working* - /apps/falcon/primaryCluster/working
We then need to specify the owner and permissions for the cluster. Click on Advanced Options drop down menu
So we enter:
<code>Owner: ambari-qa
Group: users
Permissions: 755
Owner - Check box Read, Write and Execute
Group - Check box Read and Execute
Others - Check box Read and Execute
If you want to view the XML preview of whatever values you are entering, you can click on XML preview. Click Next to view the summary.
Click Save
to persist the entity.
After clicking on the Create
drop down menu, select Cluster
button and click on the Edit XML
button over XML Preview area. Replace the XML content with the XML document below:
<code><?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="this is primary cluster" colo="primaryColo" xmlns="uri:falcon:cluster:0.1">
<tags>primaryKey=primaryValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/primaryCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/primaryCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="test" value="value1"/>
</properties>
</cluster>
Click Finish
on top of the XML Preview area to save the XML.
Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values press Next
.
Click Save
to persist the entity.
You should receive a notification that the operation was successful.
Falcon jobs require a source cluster and a destination, or target, cluster. For some jobs, this may be the same cluster, for others, such as Mirroring and Disaster Recovery, the source and target clusters will be different.
NOTE : If you want to create it from XML, skip the wizard section, and move on to the next one.
Let’s go ahead and create a second cluster by creating a cluster with the name:
<code>backupCluster
Mention backupColo
in Colo and this is backup cluster
in description.
Reenter the same information you used above except for the directory information. For the directories, use the backupCluster directories created earlier in the tutorial.
<code>Staging* - /apps/falcon/backupCluster/staging
Temp* - /tmp
Working* - /apps/falcon/backupCluster/working
Click Save
to persist the backupCluster
entity.
Click on Create
drop down menu and click Cluster
button to open up the form to create the cluster entity.
Click on the Edit XML
button over XML Preview area. Replace the XML content with the XML document below:
<code><?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="backupCluster" description="this is backup colo" colo="backupColo" xmlns="uri:falcon:cluster:0.1">
<tags>backupKey=backupValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/backupCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/backupCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="test2" value="value2"/>
</properties>
</cluster>
Click Finish
on top of the XML Preview area to save the XML and then the Next button to verify the values.
Once you have verified that these are the correct values press Next
.
Click Save
to persist the backupCluster
entity.
In this tutorial we learned how to create cluster entities in Apache Falcon using the Falcon UI. Now go ahead and start creating feeds and processes by exploring more Falcon tutorials.
You can go to following links to explore other Falcon tutorials:
Created on 03-15-2018 06:48 PM
I am getting following error while creating a cluster. Note that yarn is showing perfectly fine in Ambari. Appreciate if somebody can help
2018-03-15 18:43:17,171 INFO - [1241480588@qtp-1990385139-0 - d41607f2-0715-4016-8a10-952937969cd0:amber-qa:POST//entities/submit/cluster] ~ Retrying connect to server: sandbox-hdp.hortonworks.com/172.17.0.2:8050. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) (Client:906)