Member since
10-19-2016
151
Posts
59
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1727 | 03-22-2018 11:48 AM | |
2788 | 01-12-2018 06:25 PM | |
5004 | 01-12-2018 03:56 AM | |
7007 | 01-12-2018 03:38 AM | |
3581 | 01-02-2018 10:29 PM |
11-28-2017
05:36 AM
1 Kudo
Thanks! It works with 2.6.3. Didn't realize both script and sandbox were updated just a few days ago.
... View more
10-28-2017
04:01 AM
Thanks for not leaving the thread with out proper solution . @Rafael Coss HCC rocks 🙂
... View more
07-06-2017
09:03 AM
1 Kudo
Incremental Backup of Data from HDP to Azure using Falcon for Disaster Recovery and Burst capacity
Introduction
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Scenario
In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud hosted cluster . In our example:
This cluster is hosted on Windows Azure.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.
Prerequisite
A cluster with Apache Hadoop 2.2 configured
A cluster with Apache Falcon configured
The easiest way to meet the above prerequisites is to download the HDP Sandbox
After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that:
if Ambari is not configured on your Sandbox, go http://127.0.0.1:8000/about/ and enable Ambari.
Once Ambari is enabled, navigate to Ambari at http://127.0.0.1:8080 , login with username and password of admin and admin respectively. Then check if Falcon is running.
If Falcon is not running, start Falcon:
Steps for the Scenario
Create cluster specification XML file
Create feed (aka dataset) specification XML file
* Reference cluster specification
Create the process specification XML file
* Reference cluster specification – defines where the process runs
* Reference feed specification – defines the datasets that the process manipulates
We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job.
Staging the component of the App on HDFS
In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS:
First download this zip file called falcon.zip to your local host machine.
Navigate using your browser to the Hue – File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS.
Navigate to /user/ambari-qa folder like below:
Now we will upload the zip file we just downloaded:
This should also unzip the zip file and create a folder structure with a folder called falcon .
Setting up the destination storage on Microsoft Azure
Login to the Windows Azure portal at http://manage.windowsazure.com
Create a storage account
Wait for the storage account to be provisioned
Copy the access key and the account name in a text document. We will use the access key and the account name in later steps
The other information you will want to note down is the blob endpoint of the storage account we just created
Click on the Containers tab and create a new container called myfirstcontainer .
Configuring access to Azure Blob store from Hadoop
Login to Ambari – http://127.0.0.1:8080 with the credentials admin and admin .
Then click on HDFS from the bar on the left and then select the Configs tab.
Scroll down to the bottom of the page to the Custom hdfs-site section and click on Add property...
In the Add Property dialog, the key name will start with fs.azure.account.key. followed by your blob endpoint that you noted down in a previous step. The value will be the Azure storage key that you noted down in a previous step. Once you have filled in the values click the Add button:
Once you are back out of the new key dialog you will have to Save it by clicking on the green Save button:
Then restart all the service by clicking on the orange Restart button:
Wait for all the restart to complete
Now let’s test if we can access our container on the Azure Blob Store.
SSH in to the VM:
ssh root@127.0.0.1 -p 2222;
The password is hadoop
hdfs dfs -ls -R wasb://myfirstcontainer@saptak.blob.core.windows.net/
Issue the command from our cluster on the SSH’d terminal
Staging the specifications
From the SSH session, first we will change our user to ambari-qa . Type:
su ambari-qa
Go to the users home directory:
cd ~
Download the topology, feed and process definitions:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falconDemo.zip
Unzip the file:
unzip ./falconDemo.zip
Change Directory to the folder created:
cd falconChurnDemo/
Now let’s modify the cleansedEmailFeed.xml to point the backup cluster to our Azure Blob Store container.
Use vi to edit the file:
Modify the value of location element of the backupCluster
to look like this:
Then save it and quit vi.
Submit the entities to the cluster:
Cluster Specification
Cluster specification is one per cluster.
See below for a sample cluster specification file.
Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.
falcon entity -type cluster -submit -file oregonCluster.xml
Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center
falcon entity -type cluster -submit -file virginiaCluster.xml
If you view the XML file you will see how the cluster location and purpose has been captured in the XML file.
Feed Specification
A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time.
See below for a sample feed (a.k.a dataset) specification file.
Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.
falcon entity -type feed -submit -file rawEmailFeed.xml
Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)
falcon entity -type feed -submit -file cleansedEmailFeed.xml
Process
A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.
Here is an example of what a process specification looks like:
Back to our scenario, let’s submit the ingest and the cleanse process respectively:
The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals
falcon entity -type process -submit -file emailIngestProcess.xml
The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster
falcon entity -type process -submit -file cleanseEmailProcess.xml
Schedule the Falcon entities
So, all that is left now is to schedule the feeds and processes to get it going.
Ingest the feed
falcon entity -type feed -schedule -name rawEmailFeed
falcon entity -type process -schedule -name rawEmailIngestProcess
Cleanse the emails
falcon entity -type feed -schedule -name cleansedEmailFeed
falcon entity -type process -schedule -name cleanseEmailProcess
Processing
In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS:
In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted:
We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there.
... View more
Labels:
07-06-2017
08:42 AM
Define and Process Data Pipelines in Hadoop With Apache Falcon
Introduction
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationships between various data and processing elements and integrate with metastore/catalog such as Apache Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial we are going to walkthrough the process of:
Defining the feeds and processes
Defining and executing a data pipeline to ingest, process and persist data continuously
Prerequisites
Download Hortonworks Sandbox
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into ambari as an administrator user.
Complete the Creating Falcon Cluster tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities.
Once you have downloaded the Hortonworks sandbox and run the VM, navigate to the Ambari interface on port 8080 of the host IP address of your Sandbox VM. Login with the username of admin and password that you set for the Ambari admin user as part of the Learning the Ropes of the Hortonworks Sandbox tutorial:
Outline
Scenario
Starting Falcon
Download and stage the dataset
Define the rawEmailFeed entity using Wizard
Define the rawEmailFeed entity using XML
Define the rawEmailIngestProcess entity using Wizard
Define the rawEmailIngestProcess entity using XML
Define the cleansedEmailFeed using Wizard
Define the cleansedEmailFeed using XML
Define the cleanseEmailProcess using Wizard
Define the cleanseEmailProcess using XML
Run the feeds
Run the processes
Input and Output of the pipeline
Summary
For this tutorial, we will utilize the virtualbox environment. Therefore, VMware and Azure Sandbox users your instructions may be slightly different.
Scenario
In this tutorial, we will walk through a scenario where email data lands hourly on a cluster. In our example:
This cluster is the primary cluster located in the Oregon data center.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a Pig script grabbing the freely available Enron Corpus emails from the internet and feeding it into the pipeline.
Starting Falcon
By default, Falcon is not started on the sandbox, but you should have started the service while completing the Creating a Falcon Cluster tutorial. Do the following to verify that the Falcon service is started, or to start it if it was disabled.
In the Ambari UI, click on the Falcon icon in the left hand pane.
Then click on the Service Actions button on the top right.
Then, if the service is disabled, click on Start .
Once Falcon starts, Ambari should clearly indicate as below that the service has started:
Download and stage the dataset
Now let’s stage the dataset you will use for this tutorial. Although we perform many of these file operations below using the command line, you can also do the same with the HDFS Files View in Ambari.
Tip: You can copy and paste the commands from this tutorial.
First, enter the shell with your preferred shell client. For this tutorial, we will SSH into Hortonworks Sandbox with the command:
ssh root@127.0.0.1 -p 2222;
The default password is hadoop .
Then login as user hdfs :
su - hdfs
Then download the file falcon.zip with the following command:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falcon.zip
and then unzip with the command:
unzip falcon.zip
Now let’s give ourselves permission to upload files:
hadoop fs -chmod -R 777 /user/ambari-qa
Then let’s create a folder falcon under ambari-qa with the command:
hadoop fs -mkdir /user/ambari-qa/falcon
Now let’s upload the decompressed folder with the command
hadoop fs -copyFromLocal demo /user/ambari-qa/falcon/
Define the rawEmailFeed entity using Wizard
To create a feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Then enter the definition for the feed by giving the feed a unique name and a description. For this tutorial we will use
rawEmailFeed
and
Raw customer email feed.
Let’s also enter a tag key and value, so we can easily locate this Feed later:
externalSystem=USWestEmailServers
Feeds can be further categorised by identifying them with one or more groups. In this demo, we will group all the Feeds together by defining the group:
churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next.
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 1 hour. Change the timezone to UTC and click Next .
Enter the path of our data set on the Locations page:
/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
We will set the stats and meta paths to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for now. Click Next .
On the Clusters page select the cluster you created, then enter today’s date and the current time for the validity start time, and enter an hour or two later for the end time. The validity time specifies the period during which the feed will run. For many feeds, validity time will be set to the time the feed is scheduled to go into production and the end time will be set into the far future. Because we are running this tutorial on the Sandbox, we want to limit the time the process will run to conserve resources.
DO NOT forget to provide the validity time in the timezone you have selected.
Specify the retention period of 90 hours, then click Next .
Save the feed.
Define the rawEmailFeed entity using XML
Then click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
<tags>externalSystem=USWestEmailServers</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(1)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area.
Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values, save the feed.
Define the rawEmailIngestProcess entity using Wizard
Now lets define the rawEmailIngestProcess .
To create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the information below to create the process:
This job will run on the primaryCluster. Use the information below to create the process:
process name rawEmailIngestProcess
Tags email
With the value: testemail
And assign the workflow the name:
emailIngestWorkflow
Select Oozie as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/ingest/fs
Accept the default values for the Access Control List and click Next.
On the Properties page, select the time zone, and set the frequency to 1 hour. Set the number of parallel processes to 1. This prevents a new instance from starting prior to the previous one completing. Specify the order as first-in, first-out (FIFO).
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page, select primaryCluster again. As you did previously, modify the validity to a time slice which is in the very near future; set the validity to start now and end in an hour or two. Click Next .
Note : Time should be specified as per the timezone selected.
On the Inputs & Outputs page, ignore the Inputs section. For the output, enter a name and select the rawEmailFeed we created in the previous step and specify now(0,0) for the instance.
now(hours,minutes) : now refer to the instance start time. Hours and minutes given are in reference with the start time of instance. For example now(-2,40) corresponds to feed instance at -2 hr and +40 minutes i.e. feed instance 80 mins before the instance start time. now(0,0) corresponds to feed instance at current time.
Lets save the process.
Define the rawEmailIngestProcess entity using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>email=testemail</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailIngestWorkflow" version="4.0.1" engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time..
Click Finish on the top of the XML Preview area.
Verify all the values and click Next .
Save the process.
Define the cleansedEmailFeed using Wizard
As the name suggests, the raw feeds which you have created earlier get cleansed in the Process and the feed which comes out from that Process is the cleansed feed. So, to create a cleansed feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the following information to create the feed:
name cleansedEmailFeed
description Cleansed customer emails
tag cleanse with value cleaned
Group churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next .
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 4 hours. Change the timezone to UTC, accept the default values and click Next .
Set the default storage location to
/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Keep stats path and meta path as /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}.
Click Next .
On the Clusters page,select source cluster as a primary cluster ensure you modify the validity to a time slice which is in the very near future. Provide time in current UTC. Specify the path for the data as:
/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
And enter /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for the stats and meta data locations.
Keep the retention period as 90 hours.
Set the target cluster as backupCluster and again set the validity start for the current time and end time to an hour or two from now
And specify the data path for the target to
/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Set the statistics and meta data locations to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
Enter the retention period as 90 hours for target cluster as well.
Click Next .
Verify all the entries you have filled out and then click Save .
Define the cleansedEmailFeed using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
<tags>cleanse=cleaned</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
<cluster name="backupCluster" type="target">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area. Save the feed.
Define the cleanseEmailProcess using Wizard
Now lets define the cleanseEmailProcess .
Again, to create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Create this process with the following information
process name cleanseEmailProcess
Tag cleanse with the value yes
Then assign the workflow the name:
emailCleanseWorkflow
Select Pig as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/pig/id.pig
Accept the default values for Access Control List and click Next .
For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.
Specify the order as first-in, First-out (FIFO)
And the Frequency to 1 hour.
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click Next .
NOTE : Time should be specified as per the timezone selected.
Select the Input and Output Feeds as shown below and click Next.
Select rawEmailFeed as an input feed and cleansedEmailFeed as an output feed.
Verify all the values you have entered and then click Save .
Define the cleanseEmailProcess using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
<tags>cleanse=yes</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<inputs>
<input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
</inputs>
<outputs>
<output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailCleanseWorkflow" version="pig-0.13.0" engine="pig" path="/user/ambari-qa/falcon/demo/apps/pig/id.pig"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area, verify all the values and save the process.
Run the feeds
From the Falcon Web UI home page search for the Feeds we created
Select the rawEmailFeed by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the cleansedEmailFeed in the same way
Run the processes
From the Falcon Web UI home page search for the Process we created
Select the cleanseEmailProcess by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the rawEmailIngestProcess in the same way
If you visit the Oozie process page, you can seen the processes running
Input and Output of the pipeline
Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS.
Here is the data being ingressed
and here is the data being egressed from the pipeline
Summary
In this tutorial we walked through a scenario to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis by defining a data pipeline with Apache Falcon. You may suspend the running feeds or processes if no longer required.
... View more
Labels:
07-06-2017
08:38 AM
1 Kudo
Mirroring Datasets Between Hadoop Clusters with Apache Falcon
Introduction
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.
In this tutorial we are going walk the process of mirroring the datasets between Hadoop clusters.
Prerequisites
Download Hortonworks Sandbox 2.5
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into Ambari as an administrator user.
Complete the Leveraging Apache Falcon with Your Hadoop Clusters tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities.
Outline
1. Create ambari-qa user
2. Preparing HDFS Directories
3. Setting up the Mirroring Job
4. Running the Job
Summary
1. Create ambari-qa user
After creating cluster entities, let us go back to Ambari as admin user. Click on admin menu drop down and then Manage Ambari :
Click the blue Users button in the bottom box as given below:
Click the Create Local User button at the top of the page. Enter ambari-qa as the user name and then set the password for it. Enter it again for confirmation and Save the user.
You can see the newly added ambari-qa user. Click on it to assign it a group so that it can access Ambari views.
Write "views" and select it in Local Group Membership box and then click on tick mark to add an ambari-qa user in the "views" group.
Now logout of Ambari from the admin user and login to Ambari as ambari-qa user.
2. Preparing HDFS Directories
Select the Files View and you can view the following default folders:
Navigate to /user/ambari-qa and create a new directory falcon
Click on the row of falcon directory and then click on Permissions button:
Add Write permission for both Group and Others and then click Save .
Now create the directories mirrorSrc and mirrorTgt under /user/ambari-qa/falcon as the source and target of the mirroring job we are about to create.
<!—
After creating cluster entities, let’s go back to the SSH terminal, switch the user to root and then to ambari-qa :
hadoop fs -mkdir /user/ambari-qa/falcon
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt
Now we need to set permissions to allow access. You must be logged in as the owner of the directory /user/ambari-qa/falcon/
hadoop fs -chmod -R 777 /user/ambari-qa/falcon
–>
3. Setting up the Mirroring Job
To create the mirroring job, go back to the Falcon UI on your browser and click on the Create drop down.
Click Mirror from the drop down menu, you will see a page like this:
Provide a name of your choice. The name must be unique to the system. We named the Mirror Job MirrorTest .
Ensure the File System mirror type is selected, then select the appropriate Source and Target and type in the appropriate paths. In our case the source cluster is primaryCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorSrc .
The target cluster is backupCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorTgt .
Also set the validity of the job to your current time, so that when you attempt to run the job in a few minutes, the job is still within the validity period. Keep default values in Advanced Options and then Click Next .
Verify the summary information, then click Save :
4. Running the Job
Before we can run the job, we need some data to test on HDFS.
<!—Let’s give us permission to upload some data using the HDFS View in Ambari.
su - root
su hdfs
hadoop fs -chmod -R 775 /user/ambari-qa
Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
–>
Keep login as ambari-qa and from the view on the Ambari console navigate to the directory /user/ambari-qa/falcon/mirrorSrc .
Click Upload button and upload any file you want to use.
Once uploaded the file should appear in the directory.
Now navigate to the Falcon UI and search for the job we created. The name of the Mirror job we had created was MirrorTest .
Select the MirrorTest job by clicking the checkbox and then click on Schedule .
The state of the job should change from SUBMITTED to RUNNING .
After a few minutes, use the HDFS View in the Ambari console to check the /user/ambari-qa/falcon/mirrorTgt directory and you should see that your data is mirrored.
Summary
In this tutorial we walked through the process of mirroring the datasets between two cluster entities.
... View more
Labels:
03-15-2018
06:48 PM
I am getting following error while creating a cluster. Note that yarn is showing perfectly fine in Ambari. Appreciate if somebody can help 2018-03-15 18:43:17,171 INFO - [1241480588@qtp-1990385139-0 - d41607f2-0715-4016-8a10-952937969cd0:amber-qa:POST//entities/submit/cluster] ~ Retrying connect to server: sandbox-hdp.hortonworks.com/172.17.0.2:8050. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) (Client:906)
... View more
01-23-2018
01:25 PM
1 Kudo
after some time and the issue showing up from time to time I discovered it was an IP problem . using the virtualbox fixed it because it uses a default different IP . the problem with the IP was caused by some limitations set up in my computer by my company.
... View more
04-27-2017
04:16 AM
1 Kudo
hive.server2.transport.mode values are either binary or http. hive.server2.authentication – Authentication mode, default NONE. Options are NONE (uses plain SASL), NOSASL, KERBEROS, LDAP, PAM and CUSTOM. Read more at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Authentication/SecurityConfiguration. I believe Tableau is making you select the authentication mode and not the transport mode.
... View more
09-20-2017
05:50 PM
ssh maria_dev@xxx.xxx.xxx.xxx -p 2222 Enter password maria_dev
... View more
04-10-2017
06:15 PM
@ssanthosh Perfect, thanks!
... View more