Created on 07-06-2017 09:03 AM
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud hosted cluster . In our example:
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.
The easiest way to meet the above prerequisites is to download the HDP Sandbox
After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that:
http://127.0.0.1:8000/about/and enable Ambari.
http://127.0.0.1:8080, login with username and password of
adminrespectively. Then check if Falcon is running.
We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job.
In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS:
Navigate using your browser to the Hue – File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS.
/user/ambari-qa folder like below:
Now we will upload the zip file we just downloaded:
This should also unzip the zip file and create a folder structure with a folder called
Login to the Windows Azure portal at http://manage.windowsazure.com
Create a storage account
Wait for the storage account to be provisioned
Copy the access key and the account name in a text document. We will use the access key and the account name in later steps
The other information you will want to note down is the blob endpoint of the storage account we just created
Click on the
Containers tab and create a new container called
Login to Ambari – http://127.0.0.1:8080 with the credentials
Then click on HDFS from the bar on the left and then select the
Scroll down to the bottom of the page to the
Custom hdfs-site section and click on
Add Property dialog, the key name will start with
fs.azure.account.key. followed by your blob endpoint that you noted down in a previous step. The value will be the Azure storage key that you noted down in a previous step. Once you have filled in the values click the
Once you are back out of the new key dialog you will have to
Save it by clicking on the green
Then restart all the service by clicking on the orange
Wait for all the restart to complete
Now let’s test if we can access our container on the Azure Blob Store.
SSH in to the VM:
ssh firstname.lastname@example.org -p 2222;
The password is
hdfs dfs -ls -R wasb://email@example.com/
Issue the command from our cluster on the SSH’d terminal
From the SSH session, first we will change our user to
Go to the users home directory:
Download the topology, feed and process definitions:
Unzip the file:
Change Directory to the folder created:
Now let’s modify the
cleansedEmailFeed.xml to point the backup cluster to our Azure Blob Store container.
vi to edit the file:
Modify the value of
location element of the
to look like this:
Then save it and quit vi.
Cluster specification is one per cluster.
See below for a sample cluster specification file.
Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.
falcon entity -type cluster -submit -file oregonCluster.xml
Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center
falcon entity -type cluster -submit -file virginiaCluster.xml
If you view the XML file you will see how the cluster location and purpose has been captured in the XML file.
A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time.
See below for a sample feed (a.k.a dataset) specification file.
Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.
falcon entity -type feed -submit -file rawEmailFeed.xml
Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)
falcon entity -type feed -submit -file cleansedEmailFeed.xml
A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.
Here is an example of what a process specification looks like:
Back to our scenario, let’s submit the ingest and the cleanse process respectively:
The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals
falcon entity -type process -submit -file emailIngestProcess.xml
The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster
falcon entity -type process -submit -file cleanseEmailProcess.xml
So, all that is left now is to schedule the feeds and processes to get it going.
falcon entity -type feed -schedule -name rawEmailFeed
falcon entity -type process -schedule -name rawEmailIngestProcess
falcon entity -type feed -schedule -name cleansedEmailFeed
falcon entity -type process -schedule -name cleanseEmailProcess
In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS:
In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted:
We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there.