Member since
10-19-2016
151
Posts
59
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1747 | 03-22-2018 11:48 AM | |
2829 | 01-12-2018 06:25 PM | |
5081 | 01-12-2018 03:56 AM | |
7070 | 01-12-2018 03:38 AM | |
3630 | 01-02-2018 10:29 PM |
11-27-2017
06:48 PM
2 Kudos
@Weiren Ding Sorry for the confusion. The immediate problem here is that you're using the latest "start_sandbox-hdp.sh" script, which was meant for the latest version of the HDP sandbox (HDP 2.6.3). Since you're using the HDP 2.6.1 sandbox, they're not necessarily compatible. Since this is the first time you're deploying the sandbox, I'd suggest getting the latest sandbox (https://hortonworks.com/downloads/#sandbox). Otherwise, if you want to stick with HDP 2.6.1, the startup script you want is here (but I recommend updating :p): https://raw.githubusercontent.com/hortonworks/data-tutorials/21503170e85fe330beffdd95f984a323a18a5a42/tutorials/hdp/sandbox-deployment-and-install-guide/assets/start_sandbox-hdp.sh
... View more
11-27-2017
04:49 PM
1 Kudo
@Weiren Ding Could you open up start_sandbox-hdp.sh and look for a line (around line 101) that starts with "echo Waiting for ambari agent to ..."? The very next line probably has something like "... --user admin:4o12t0n ..." Change the value after the colon to whatever password you set when you reset the ambari-server credentials. Let me know how it goes!
... View more
11-02-2017
03:31 PM
2 Kudos
Hey @Divya Sodha, good concern. We're working on a slight redesign of this tutorial to ameliorate at least part of the issue of needing as much RAM, though some of the services are simply so beefy that there is a lower bound to required memory. One alternative to getting by with less memory for now is stopping any unused services via Ambari - that'll provide the largest savings. Pinging the designer of this particular tutorial, @jmedel, to see about including this as a step in the tutorial for the next update. Regarding your other question, about using SSD/flash drives would help - unfortunately, not really. The bottleneck here is RAM, and while increased disk space would help with swap, performance gains are very likely minimal considering average available disk space. Another alternative would be to leverage a single, large, platform rather than two sandboxes (HDP + HDF). I'll bring this up to the tutorial designer and get a discussion going.
... View more
10-27-2017
10:52 PM
Heya, I just wanted to point out that while the names of the repositories are the same, the one that OP mentioned in the original post is further along than the one currently in the hortonworks repository (as of this posting). People following along with the tutorial should refer to the repository specified in the tutorial. Unfortunately, it went missing momentarily and so the link broke 😞 Should be back up, though 🙂
... View more
10-27-2017
10:47 PM
2 Kudos
Hey @ashok koti, as you pointed out there was an issue with the reference github repo, but it should be resolved. Following the steps in the tutorial should work once again. Let me know if you run into any other issues with the tutorial. 🙂
... View more
07-06-2017
09:03 AM
1 Kudo
Incremental Backup of Data from HDP to Azure using Falcon for Disaster Recovery and Burst capacity
Introduction
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Scenario
In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud hosted cluster . In our example:
This cluster is hosted on Windows Azure.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.
Prerequisite
A cluster with Apache Hadoop 2.2 configured
A cluster with Apache Falcon configured
The easiest way to meet the above prerequisites is to download the HDP Sandbox
After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that:
if Ambari is not configured on your Sandbox, go http://127.0.0.1:8000/about/ and enable Ambari.
Once Ambari is enabled, navigate to Ambari at http://127.0.0.1:8080 , login with username and password of admin and admin respectively. Then check if Falcon is running.
If Falcon is not running, start Falcon:
Steps for the Scenario
Create cluster specification XML file
Create feed (aka dataset) specification XML file
* Reference cluster specification
Create the process specification XML file
* Reference cluster specification – defines where the process runs
* Reference feed specification – defines the datasets that the process manipulates
We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job.
Staging the component of the App on HDFS
In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS:
First download this zip file called falcon.zip to your local host machine.
Navigate using your browser to the Hue – File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS.
Navigate to /user/ambari-qa folder like below:
Now we will upload the zip file we just downloaded:
This should also unzip the zip file and create a folder structure with a folder called falcon .
Setting up the destination storage on Microsoft Azure
Login to the Windows Azure portal at http://manage.windowsazure.com
Create a storage account
Wait for the storage account to be provisioned
Copy the access key and the account name in a text document. We will use the access key and the account name in later steps
The other information you will want to note down is the blob endpoint of the storage account we just created
Click on the Containers tab and create a new container called myfirstcontainer .
Configuring access to Azure Blob store from Hadoop
Login to Ambari – http://127.0.0.1:8080 with the credentials admin and admin .
Then click on HDFS from the bar on the left and then select the Configs tab.
Scroll down to the bottom of the page to the Custom hdfs-site section and click on Add property...
In the Add Property dialog, the key name will start with fs.azure.account.key. followed by your blob endpoint that you noted down in a previous step. The value will be the Azure storage key that you noted down in a previous step. Once you have filled in the values click the Add button:
Once you are back out of the new key dialog you will have to Save it by clicking on the green Save button:
Then restart all the service by clicking on the orange Restart button:
Wait for all the restart to complete
Now let’s test if we can access our container on the Azure Blob Store.
SSH in to the VM:
ssh root@127.0.0.1 -p 2222;
The password is hadoop
hdfs dfs -ls -R wasb://myfirstcontainer@saptak.blob.core.windows.net/
Issue the command from our cluster on the SSH’d terminal
Staging the specifications
From the SSH session, first we will change our user to ambari-qa . Type:
su ambari-qa
Go to the users home directory:
cd ~
Download the topology, feed and process definitions:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falconDemo.zip
Unzip the file:
unzip ./falconDemo.zip
Change Directory to the folder created:
cd falconChurnDemo/
Now let’s modify the cleansedEmailFeed.xml to point the backup cluster to our Azure Blob Store container.
Use vi to edit the file:
Modify the value of location element of the backupCluster
to look like this:
Then save it and quit vi.
Submit the entities to the cluster:
Cluster Specification
Cluster specification is one per cluster.
See below for a sample cluster specification file.
Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.
falcon entity -type cluster -submit -file oregonCluster.xml
Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center
falcon entity -type cluster -submit -file virginiaCluster.xml
If you view the XML file you will see how the cluster location and purpose has been captured in the XML file.
Feed Specification
A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time.
See below for a sample feed (a.k.a dataset) specification file.
Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.
falcon entity -type feed -submit -file rawEmailFeed.xml
Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)
falcon entity -type feed -submit -file cleansedEmailFeed.xml
Process
A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.
Here is an example of what a process specification looks like:
Back to our scenario, let’s submit the ingest and the cleanse process respectively:
The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals
falcon entity -type process -submit -file emailIngestProcess.xml
The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster
falcon entity -type process -submit -file cleanseEmailProcess.xml
Schedule the Falcon entities
So, all that is left now is to schedule the feeds and processes to get it going.
Ingest the feed
falcon entity -type feed -schedule -name rawEmailFeed
falcon entity -type process -schedule -name rawEmailIngestProcess
Cleanse the emails
falcon entity -type feed -schedule -name cleansedEmailFeed
falcon entity -type process -schedule -name cleanseEmailProcess
Processing
In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS:
In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted:
We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there.
... View more
Labels:
07-06-2017
08:42 AM
Define and Process Data Pipelines in Hadoop With Apache Falcon
Introduction
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationships between various data and processing elements and integrate with metastore/catalog such as Apache Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial we are going to walkthrough the process of:
Defining the feeds and processes
Defining and executing a data pipeline to ingest, process and persist data continuously
Prerequisites
Download Hortonworks Sandbox
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into ambari as an administrator user.
Complete the Creating Falcon Cluster tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities.
Once you have downloaded the Hortonworks sandbox and run the VM, navigate to the Ambari interface on port 8080 of the host IP address of your Sandbox VM. Login with the username of admin and password that you set for the Ambari admin user as part of the Learning the Ropes of the Hortonworks Sandbox tutorial:
Outline
Scenario
Starting Falcon
Download and stage the dataset
Define the rawEmailFeed entity using Wizard
Define the rawEmailFeed entity using XML
Define the rawEmailIngestProcess entity using Wizard
Define the rawEmailIngestProcess entity using XML
Define the cleansedEmailFeed using Wizard
Define the cleansedEmailFeed using XML
Define the cleanseEmailProcess using Wizard
Define the cleanseEmailProcess using XML
Run the feeds
Run the processes
Input and Output of the pipeline
Summary
For this tutorial, we will utilize the virtualbox environment. Therefore, VMware and Azure Sandbox users your instructions may be slightly different.
Scenario
In this tutorial, we will walk through a scenario where email data lands hourly on a cluster. In our example:
This cluster is the primary cluster located in the Oregon data center.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a Pig script grabbing the freely available Enron Corpus emails from the internet and feeding it into the pipeline.
Starting Falcon
By default, Falcon is not started on the sandbox, but you should have started the service while completing the Creating a Falcon Cluster tutorial. Do the following to verify that the Falcon service is started, or to start it if it was disabled.
In the Ambari UI, click on the Falcon icon in the left hand pane.
Then click on the Service Actions button on the top right.
Then, if the service is disabled, click on Start .
Once Falcon starts, Ambari should clearly indicate as below that the service has started:
Download and stage the dataset
Now let’s stage the dataset you will use for this tutorial. Although we perform many of these file operations below using the command line, you can also do the same with the HDFS Files View in Ambari.
Tip: You can copy and paste the commands from this tutorial.
First, enter the shell with your preferred shell client. For this tutorial, we will SSH into Hortonworks Sandbox with the command:
ssh root@127.0.0.1 -p 2222;
The default password is hadoop .
Then login as user hdfs :
su - hdfs
Then download the file falcon.zip with the following command:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falcon.zip
and then unzip with the command:
unzip falcon.zip
Now let’s give ourselves permission to upload files:
hadoop fs -chmod -R 777 /user/ambari-qa
Then let’s create a folder falcon under ambari-qa with the command:
hadoop fs -mkdir /user/ambari-qa/falcon
Now let’s upload the decompressed folder with the command
hadoop fs -copyFromLocal demo /user/ambari-qa/falcon/
Define the rawEmailFeed entity using Wizard
To create a feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Then enter the definition for the feed by giving the feed a unique name and a description. For this tutorial we will use
rawEmailFeed
and
Raw customer email feed.
Let’s also enter a tag key and value, so we can easily locate this Feed later:
externalSystem=USWestEmailServers
Feeds can be further categorised by identifying them with one or more groups. In this demo, we will group all the Feeds together by defining the group:
churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next.
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 1 hour. Change the timezone to UTC and click Next .
Enter the path of our data set on the Locations page:
/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
We will set the stats and meta paths to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for now. Click Next .
On the Clusters page select the cluster you created, then enter today’s date and the current time for the validity start time, and enter an hour or two later for the end time. The validity time specifies the period during which the feed will run. For many feeds, validity time will be set to the time the feed is scheduled to go into production and the end time will be set into the far future. Because we are running this tutorial on the Sandbox, we want to limit the time the process will run to conserve resources.
DO NOT forget to provide the validity time in the timezone you have selected.
Specify the retention period of 90 hours, then click Next .
Save the feed.
Define the rawEmailFeed entity using XML
Then click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
<tags>externalSystem=USWestEmailServers</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(1)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area.
Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values, save the feed.
Define the rawEmailIngestProcess entity using Wizard
Now lets define the rawEmailIngestProcess .
To create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the information below to create the process:
This job will run on the primaryCluster. Use the information below to create the process:
process name rawEmailIngestProcess
Tags email
With the value: testemail
And assign the workflow the name:
emailIngestWorkflow
Select Oozie as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/ingest/fs
Accept the default values for the Access Control List and click Next.
On the Properties page, select the time zone, and set the frequency to 1 hour. Set the number of parallel processes to 1. This prevents a new instance from starting prior to the previous one completing. Specify the order as first-in, first-out (FIFO).
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page, select primaryCluster again. As you did previously, modify the validity to a time slice which is in the very near future; set the validity to start now and end in an hour or two. Click Next .
Note : Time should be specified as per the timezone selected.
On the Inputs & Outputs page, ignore the Inputs section. For the output, enter a name and select the rawEmailFeed we created in the previous step and specify now(0,0) for the instance.
now(hours,minutes) : now refer to the instance start time. Hours and minutes given are in reference with the start time of instance. For example now(-2,40) corresponds to feed instance at -2 hr and +40 minutes i.e. feed instance 80 mins before the instance start time. now(0,0) corresponds to feed instance at current time.
Lets save the process.
Define the rawEmailIngestProcess entity using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>email=testemail</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailIngestWorkflow" version="4.0.1" engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time..
Click Finish on the top of the XML Preview area.
Verify all the values and click Next .
Save the process.
Define the cleansedEmailFeed using Wizard
As the name suggests, the raw feeds which you have created earlier get cleansed in the Process and the feed which comes out from that Process is the cleansed feed. So, to create a cleansed feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the following information to create the feed:
name cleansedEmailFeed
description Cleansed customer emails
tag cleanse with value cleaned
Group churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next .
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 4 hours. Change the timezone to UTC, accept the default values and click Next .
Set the default storage location to
/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Keep stats path and meta path as /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}.
Click Next .
On the Clusters page,select source cluster as a primary cluster ensure you modify the validity to a time slice which is in the very near future. Provide time in current UTC. Specify the path for the data as:
/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
And enter /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for the stats and meta data locations.
Keep the retention period as 90 hours.
Set the target cluster as backupCluster and again set the validity start for the current time and end time to an hour or two from now
And specify the data path for the target to
/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Set the statistics and meta data locations to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
Enter the retention period as 90 hours for target cluster as well.
Click Next .
Verify all the entries you have filled out and then click Save .
Define the cleansedEmailFeed using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
<tags>cleanse=cleaned</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
<cluster name="backupCluster" type="target">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area. Save the feed.
Define the cleanseEmailProcess using Wizard
Now lets define the cleanseEmailProcess .
Again, to create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Create this process with the following information
process name cleanseEmailProcess
Tag cleanse with the value yes
Then assign the workflow the name:
emailCleanseWorkflow
Select Pig as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/pig/id.pig
Accept the default values for Access Control List and click Next .
For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.
Specify the order as first-in, First-out (FIFO)
And the Frequency to 1 hour.
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click Next .
NOTE : Time should be specified as per the timezone selected.
Select the Input and Output Feeds as shown below and click Next.
Select rawEmailFeed as an input feed and cleansedEmailFeed as an output feed.
Verify all the values you have entered and then click Save .
Define the cleanseEmailProcess using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
<tags>cleanse=yes</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<inputs>
<input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
</inputs>
<outputs>
<output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailCleanseWorkflow" version="pig-0.13.0" engine="pig" path="/user/ambari-qa/falcon/demo/apps/pig/id.pig"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area, verify all the values and save the process.
Run the feeds
From the Falcon Web UI home page search for the Feeds we created
Select the rawEmailFeed by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the cleansedEmailFeed in the same way
Run the processes
From the Falcon Web UI home page search for the Process we created
Select the cleanseEmailProcess by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the rawEmailIngestProcess in the same way
If you visit the Oozie process page, you can seen the processes running
Input and Output of the pipeline
Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS.
Here is the data being ingressed
and here is the data being egressed from the pipeline
Summary
In this tutorial we walked through a scenario to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis by defining a data pipeline with Apache Falcon. You may suspend the running feeds or processes if no longer required.
... View more
Labels:
07-06-2017
08:38 AM
1 Kudo
Mirroring Datasets Between Hadoop Clusters with Apache Falcon
Introduction
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.
In this tutorial we are going walk the process of mirroring the datasets between Hadoop clusters.
Prerequisites
Download Hortonworks Sandbox 2.5
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into Ambari as an administrator user.
Complete the Leveraging Apache Falcon with Your Hadoop Clusters tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities.
Outline
1. Create ambari-qa user
2. Preparing HDFS Directories
3. Setting up the Mirroring Job
4. Running the Job
Summary
1. Create ambari-qa user
After creating cluster entities, let us go back to Ambari as admin user. Click on admin menu drop down and then Manage Ambari :
Click the blue Users button in the bottom box as given below:
Click the Create Local User button at the top of the page. Enter ambari-qa as the user name and then set the password for it. Enter it again for confirmation and Save the user.
You can see the newly added ambari-qa user. Click on it to assign it a group so that it can access Ambari views.
Write "views" and select it in Local Group Membership box and then click on tick mark to add an ambari-qa user in the "views" group.
Now logout of Ambari from the admin user and login to Ambari as ambari-qa user.
2. Preparing HDFS Directories
Select the Files View and you can view the following default folders:
Navigate to /user/ambari-qa and create a new directory falcon
Click on the row of falcon directory and then click on Permissions button:
Add Write permission for both Group and Others and then click Save .
Now create the directories mirrorSrc and mirrorTgt under /user/ambari-qa/falcon as the source and target of the mirroring job we are about to create.
<!—
After creating cluster entities, let’s go back to the SSH terminal, switch the user to root and then to ambari-qa :
hadoop fs -mkdir /user/ambari-qa/falcon
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt
Now we need to set permissions to allow access. You must be logged in as the owner of the directory /user/ambari-qa/falcon/
hadoop fs -chmod -R 777 /user/ambari-qa/falcon
–>
3. Setting up the Mirroring Job
To create the mirroring job, go back to the Falcon UI on your browser and click on the Create drop down.
Click Mirror from the drop down menu, you will see a page like this:
Provide a name of your choice. The name must be unique to the system. We named the Mirror Job MirrorTest .
Ensure the File System mirror type is selected, then select the appropriate Source and Target and type in the appropriate paths. In our case the source cluster is primaryCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorSrc .
The target cluster is backupCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorTgt .
Also set the validity of the job to your current time, so that when you attempt to run the job in a few minutes, the job is still within the validity period. Keep default values in Advanced Options and then Click Next .
Verify the summary information, then click Save :
4. Running the Job
Before we can run the job, we need some data to test on HDFS.
<!—Let’s give us permission to upload some data using the HDFS View in Ambari.
su - root
su hdfs
hadoop fs -chmod -R 775 /user/ambari-qa
Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
–>
Keep login as ambari-qa and from the view on the Ambari console navigate to the directory /user/ambari-qa/falcon/mirrorSrc .
Click Upload button and upload any file you want to use.
Once uploaded the file should appear in the directory.
Now navigate to the Falcon UI and search for the job we created. The name of the Mirror job we had created was MirrorTest .
Select the MirrorTest job by clicking the checkbox and then click on Schedule .
The state of the job should change from SUBMITTED to RUNNING .
After a few minutes, use the HDFS View in the Ambari console to check the /user/ambari-qa/falcon/mirrorTgt directory and you should see that your data is mirrored.
Summary
In this tutorial we walked through the process of mirroring the datasets between two cluster entities.
... View more
Labels:
07-06-2017
08:34 AM
Create a Falcon Cluster Introduction Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial, we are going to create a Falcon cluster by :
Preparing up HDFS directories
Creating two cluster entities (primaryCluster and backupCluster) Prerequisite
Download Hortonworks Sandbox 2.5
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into Ambari as an administrator user. Once you have downloaded the Hortonworks Sandbox and run the VM, navigate to the Ambari interface on the port 8080 of the IP address of your Sandbox VM. Login with the username of admin and the password that you set for the Ambari admin user. You should have a similar image as below: Outline
1. Scenario
2. Starting Falcon
3. Create a Ambari Falcon user
4. Preparing HDFS Directories
5. Creating Cluster Entities
5.1 Creating primaryCluster Entity using Wizard
5.2 Creating primaryCluster Entity using XML
5.3 Creating backupCluster Entity using Wizard
5.4 Creating backupCluster Entity using XML
Summary
Further Reading 1. Scenario In this tutorial, we are going to create a Falcon cluster so that we can configure data pipelines and then perform the feed management services such as feed retention, data replication across clusters and archival. This tutorial is the starting point of all Falcon tutorials where we create two cluster entities which define where the data and the processes for your data pipeline are stored. Allow yourself 1 quality hour to complete this tutorial. 2. Starting Falcon By default, Falcon is not started on the Sandbox. You can start the Falcon service from Ambari by clicking on the Falcon icon in the left hand pane: Then click on the Service Actions button on the top right: Then click on Start : Once Falcon starts, Ambari should clearly indicate as below that the service has started: 3. Create a Ambari falcon user Click the button at the top of the Ambari screen with admin menu drop down and click Manage Ambari . Click the blue Users button in the bottom box as given below: Click the Create Local User button at the top of the page. Enter falcon as the user name and then set the password for it. Enter it again for confirmation and Save the user. You can see the newly added falcon user. Click on it to assign it a group so that it can access Ambari views.
Write "views" and select it in Local Group Membership box and then click on tick mark to add a falcon user in the "views" group. Now logout of Ambari from the admin user and login to Ambari as falcon user. 4. Preparing HDFS Directories Select the Files View like given below: The Files View Interface will appear with the following default folders. We need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster and backupCluster .
Navigate to /apps/falcon folder, click the New Folder button , an add new folder window appears and name the folder primaryCluster. Press enter or Add Similarly, create another folder called backupCluster , you will see your new directories created successfully: Click on the row of primaryCluster directory and then click on Permissions button: Add Write permission for both Group and Others and then click Save . Do the same for backupCluster directory. Now navigate down into the primaryCluster directory and create two new directories: staging and working . Click on the row for the staging directory and add Write permission for both Group and Others. Refresh the page and then navigate to /apps/falcon/primaryCluster to see the changes: Repeat the same steps for backupCluster . Create two directories- staging and working and then assign Write permission in staging directory for Group and Others.
<!—
First SSH into the Hortonworks Sandbox with the command: <code>ssh root@127.0.0.1 -p 2222
The default password is hadoop . If you have changed it earlier, then enter the new one. We need to create the directories on HDFS representing the two clusters that we are going to define, namely primaryCluster and backupCluster . First, from the command line, check whether the Falcon server is running or not.
Switch the user to Falcon using: <code>su - falcon
Change the directory to your HDP version: <code>cd /usr/hdp/current/falcon-server
And run the below script to find the status of Falcon server: <code>./bin/falcon-status
Next, use hdfs dfs -mkdir commands to create the directories /apps/falcon/primaryCluster and /apps/falcon/backupCluster on HDFS. <code> hdfs dfs -mkdir /apps/falcon/primaryCluster
hdfs dfs -mkdir /apps/falcon/backupCluster
Further create directories called staging inside each of the directories we created above: <code> hdfs dfs -mkdir /apps/falcon/primaryCluster/staging
hdfs dfs -mkdir /apps/falcon/backupCluster/staging
Next, create the working directories for primaryCluster and backupCluster : <code> hdfs dfs -mkdir /apps/falcon/primaryCluster/working
hdfs dfs -mkdir /apps/falcon/backupCluster/working
Finally you need to set the proper permissions on the staging/working directories: <code>hdfs dfs -chmod 777 /apps/falcon/primaryCluster/staging
hdfs dfs -chmod 755 /apps/falcon/primaryCluster/working
hdfs dfs -chmod 777 /apps/falcon/backupCluster/staging
hdfs dfs -chmod 755 /apps/falcon/backupCluster/working
–> 5. Creating Cluster Entities Let’s open the Falcon Web UI. You can navigate to the Falcon Web UI directly on the browser. Type 127.0.0.1:15000 . The Falcon UI is by default at port 15000. The default username is ambari-qa .
<!—
You can easily launch the Falcon Web UI from Ambari:
Navigate to the Falcon Summary page and click Quick Links>Falcon Web UI .
–> This UI allows us to create and manage the various entities like Cluster, Feed, Process and Mirror. Each of these entities are represented by an XML file that you either directly upload or generate by filling out the various fields.
You can also search for existing entities and then edit, change state, etc. Let’s first create a couple of cluster entities. To create a cluster entity click on the Create dropdown, Click Cluster on the top. NOTE : If you want to create it from XML, skip the wizard section, and move on to the next one. 5.1 Creating primaryCluster Entity using Wizard A cluster entity defines the default access points for various resources on the cluster as well as default working directories to be used by Falcon jobs. To define a cluster entity, we must specify a unique name by which we can identify the cluster. In this tutorial, we use: <code>primaryCluster
Next enter a data center name or location of the cluster and a description for the cluster. The data center name can be used by Falcon to improve performance of jobs that run locally or across data centers. Mention primaryColo in Colo and this is primary cluster in description. All entities defined in Falcon can be grouped and located using tags. To clearly identify and locate entities, we assign the tag: <code>EntityType
With the value <code>Cluster
Next, we enter the URI for the various resources Falcon requires to manage data on the clusters. These include the NameNode dfs.http.address, the NameNode IPC address used for file system metadata operations, the Yarn client IPC address used for executing jobs on Yarn, the Oozie address used for running Falcon Feeds and Processes, and the Falcon messaging address. The values we will use are the defaults for the Hortonworks Sandbox; if you run this tutorial on your own test cluster, modify the addresses to match those defined in Ambari: <code>Namenode DFS Address - hftp://sandbox.hortonworks.com:50070
File System Default Address - hdfs://sandbox.hortonworks.com:8020
YARN Resource Manager Address - sandbox.hortonworks.com:8050
Workflow Address - http://sandbox.hortonworks.com:11000/oozie/
Message Broker Address - tcp://sandbox.hortonworks.com:61616?daemon=true
You can also override cluster properties for a specific cluster. This can be useful for test or backup clusters which may have different physical configurations. In this tutorial, we’ll just use the properties defined in Ambari.
After the resources are defined, you must define default staging, temporary, and working directories for use by Falcon jobs based on the HDFS directories you created earlier in the tutorial. These can be overridden by specific jobs, but will be used in the event no directories are defined at the job level. In the current version of the UI, these directories must exist, be owned by Falcon, and have the proper permissions. <code>Staging* - /apps/falcon/primaryCluster/staging
Temp* - /tmp
Working* - /apps/falcon/primaryCluster/working
We then need to specify the owner and permissions for the cluster. Click on Advanced Options drop down menu So we enter: <code>Owner: ambari-qa
Group: users
Permissions: 755
Owner - Check box Read, Write and Execute
Group - Check box Read and Execute
Others - Check box Read and Execute
If you want to view the XML preview of whatever values you are entering, you can click on XML preview. Click Next to view the summary. Click Save to persist the entity. 5.2 Creating primaryCluster Entity using XML After clicking on the Create drop down menu, select Cluster button and click on the Edit XML button over XML Preview area. Replace the XML content with the XML document below: <code><?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="this is primary cluster" colo="primaryColo" xmlns="uri:falcon:cluster:0.1">
<tags>primaryKey=primaryValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/primaryCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/primaryCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="test" value="value1"/>
</properties>
</cluster>
Click Finish on top of the XML Preview area to save the XML. Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values press Next . Click Save to persist the entity. You should receive a notification that the operation was successful. Falcon jobs require a source cluster and a destination, or target, cluster. For some jobs, this may be the same cluster, for others, such as Mirroring and Disaster Recovery, the source and target clusters will be different. NOTE : If you want to create it from XML, skip the wizard section, and move on to the next one. 5.3 Creating backupCluster Entity using Wizard Let’s go ahead and create a second cluster by creating a cluster with the name: <code>backupCluster
Mention backupColo in Colo and this is backup cluster in description. Reenter the same information you used above except for the directory information. For the directories, use the backupCluster directories created earlier in the tutorial. <code>Staging* - /apps/falcon/backupCluster/staging
Temp* - /tmp
Working* - /apps/falcon/backupCluster/working
Click Save to persist the backupCluster entity. 5.4 Creating backupCluster Entity using XML Click on Create drop down menu and click Cluster button to open up the form to create the cluster entity.
Click on the Edit XML button over XML Preview area. Replace the XML content with the XML document below: <code><?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="backupCluster" description="this is backup colo" colo="backupColo" xmlns="uri:falcon:cluster:0.1">
<tags>backupKey=backupValue</tags>
<interfaces>
<interface type="readonly" endpoint="hftp://sandbox.hortonworks.com:50070" version="2.2.0"/>
<interface type="write" endpoint="hdfs://sandbox.hortonworks.com:8020" version="2.2.0"/>
<interface type="execute" endpoint="sandbox.hortonworks.com:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://sandbox.hortonworks.com:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://sandbox.hortonworks.com:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/backupCluster/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/backupCluster/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
<properties>
<property name="test2" value="value2"/>
</properties>
</cluster>
Click Finish on top of the XML Preview area to save the XML and then the Next button to verify the values. Once you have verified that these are the correct values press Next . Click Save to persist the backupCluster entity. Summary In this tutorial we learned how to create cluster entities in Apache Falcon using the Falcon UI. Now go ahead and start creating feeds and processes by exploring more Falcon tutorials. Further Reading You can go to following links to explore other Falcon tutorials:
Mirroring Datasets between Hadoop Clusters with Apache Falcon
Define and Process Data Pipelines in Hadoop with Apache Falcon
Incremental Backup of data from HDP to Azure using Falcon for Disaster Recovery and Burst Capacity
Processing Data Pipeline using Apache Falcon
... View more
Labels:
06-19-2017
09:24 PM
Hey @Ronny Lempel, thanks for reporting that! Getting that cleaned up as I type this. Here's the new tutorial that should be taking it's place: https://hortonworks.com/tutorial/realtime-event-processing-in-hadoop-with-nifi-kafka-and-storm/
... View more