Member since
10-19-2016
151
Posts
59
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1001 | 03-22-2018 11:48 AM | |
1526 | 01-12-2018 06:25 PM | |
2648 | 01-12-2018 03:56 AM | |
4230 | 01-12-2018 03:38 AM | |
2301 | 01-02-2018 10:29 PM |
11-28-2017
04:27 PM
1 Kudo
@Kaliyug Antagonist Hmm, could you check to make sure that the VM has been allocated enough memory? In Virtualbox, select your sandbox VM -> Settings -> System. Make sure that base memory is set to at least 8GB. Sometimes it's the case that people reduce this during deployment by accident. Additionally, could you verify that the MD5 checksum of the sandbox that you downloaded is the same as what is posted on the Hortonworks download page? The MD5 checksum for the HDP 2.6.3 sandbox should be "2c5cca94a4d6fd1e8383abd930646f2a".
... View more
11-27-2017
06:48 PM
2 Kudos
@Weiren Ding Sorry for the confusion. The immediate problem here is that you're using the latest "start_sandbox-hdp.sh" script, which was meant for the latest version of the HDP sandbox (HDP 2.6.3). Since you're using the HDP 2.6.1 sandbox, they're not necessarily compatible. Since this is the first time you're deploying the sandbox, I'd suggest getting the latest sandbox (https://hortonworks.com/downloads/#sandbox). Otherwise, if you want to stick with HDP 2.6.1, the startup script you want is here (but I recommend updating :p): https://raw.githubusercontent.com/hortonworks/data-tutorials/21503170e85fe330beffdd95f984a323a18a5a42/tutorials/hdp/sandbox-deployment-and-install-guide/assets/start_sandbox-hdp.sh
... View more
11-27-2017
05:13 PM
2 Kudos
@Abhilash Anil You should still be able to deploy HDP Sandbox with Docker, though instead of following the deployment tutorial on Hortonworks' site, you'd do the equivalent using Docker Toolbox tools.
... View more
11-27-2017
04:49 PM
1 Kudo
@Weiren Ding Could you open up start_sandbox-hdp.sh and look for a line (around line 101) that starts with "echo Waiting for ambari agent to ..."? The very next line probably has something like "... --user admin:4o12t0n ..." Change the value after the colon to whatever password you set when you reset the ambari-server credentials. Let me know how it goes!
... View more
11-27-2017
04:37 PM
2 Kudos
@Muhammad idrees Before following the suggestion in the post you listed, I would suggest considering updating to the latest HDP sandbox. It's currently at HDP 2.6.3. If you do, then installing Solr is walked through here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_solr-search-installation/content/ch_hdp-search-install-ambari.html If you want to stick with HDP 2.5, then the fix suggested in the post you listed is to edit the file: /var/lib/ambari-server/resources/stacks/HDP/2.5/repos/repoinfo.xml and remove the "-centos6" from the different entries. Let me know how it goes.
... View more
11-27-2017
04:28 PM
@Mustafa Hasanbulli Hmm, looks like it can be an SSH issue. Are you connecting to your sandbox via SSH tunneling? How are you connecting (using a config file, or SSH'ing with a -L option from a shell)? For reference
... View more
11-27-2017
04:12 PM
@Aaron Rebello Jay's answer looks good - I just wanted to post a note for future reference: When given the option of CentOS versions (or more accurately, kernels), go with the latest one when dealing with the sandbox. Future versions of the sandbox should be removing unused kernels (3.x in this case), but if you are given an option when booting the sandbox - go with the latest. Although, not making a selection will default to the first (and latest), so all is good either way.
... View more
11-18-2017
03:28 AM
1 Kudo
@Emre Berk AKIRMAK Heya, hoping you got this issue resolved but just in case: Is this happening before or after you put in a password? The first prompt is asking you to put in the current password, it may or may not (I forget, at this point) close your session if you get the current password wrong. In case I'm wrong, then it looks like an issue with ShellInABox, the service that is exposing SSH on the browser. If it is kicking you off before you get a chance to put any password in, I can take a closer look. In the meantime, if you're putting in the current password and it's still kicking you out, try SSH'ing in through a shell (or download "git bash" if you're on Windows). Let me know how it goes 🙂
... View more
11-18-2017
02:58 AM
4 Kudos
@Mustafa Hasanbulli Oh, that tutorial you linked to should no longer exist. I'll get that fixed, thanks for posted. Here's what you're looking for - it's an updated tutorial: https://hortonworks.com/tutorial/sandbox-port-forwarding-guide/section/4/
... View more
11-18-2017
02:56 AM
1 Kudo
@Weiren Ding Found this, looks promising if you're using an external drive - https://forums.virtualbox.org/viewtopic.php?f=8&t=56587
... View more
11-18-2017
02:54 AM
1 Kudo
@Steven Tam @Dave Welden Thanks for reporting this (and the resolution using Hive). We'll get this fixed and rolled back out - I reported this on the repo we use to develop tutorials/docs https://github.com/hortonworks/data-tutorials/issues/404
... View more
11-18-2017
02:43 AM
@Srinivasan Hariharan In addition to trying what @Douglas Lunt suggested, could you post any relevant log files? /var/log/mysql for example
... View more
11-18-2017
02:39 AM
@Anca Alecsandru 1) Are the other services in the sandbox running? That is, is Ambari (port 8080) running? 2) I'm guessing you're using HDP 2.4 for a specific reason, but in case you're looking to try the HDP 2.6.3 sandbox, it's available to download.
... View more
11-18-2017
02:33 AM
1 Kudo
@amir sanjar Here's someone that you might also find useful: https://github.com/orendain/hortonworks-sandbox
... View more
11-18-2017
02:31 AM
1 Kudo
@amir sanjar For internal use, I have something you'd probably like - would you mind pinging me (eorendain@)? To answer this question in general, and for everyone else: With the current HDP sandbox, there is no source Dockerfile. Instead, anyone can base their own Docker image off of the HDP SB by using the "FROM" directive in their own Dockerfile once they have the sandbox Docker image (from the Hortonworks download page). It's not ideal, but it is a workaround in light of no source Dockerfile.
... View more
11-18-2017
02:19 AM
1 Kudo
@Jack Oney In VirtualBox, right click on the existing virtual machine you want to remove and hit "Remove". I've attached a screengrab.43431.png
... View more
11-07-2017
05:43 AM
1 Kudo
@Dave Welden @Tok Luo @Rad Andric After verifying that Ambari is up and that a particular service is up (it's got the green light in Ambari), let's say NiFi for example, can you navigate to http://localhost:9090 ? Let's also go ahead and try http://localhost:19090 (notice the '1' in the port), in case you're working in a specific version of the HDF sandbox. If the localhost domain works, but sandbox-hdf.hortonworks.com does not, then it's an issue with the host aliasing. Any issues using localhost? Let's get to the bottom of this! Edgar
... View more
11-05-2017
01:35 AM
1 Kudo
Hey @Saravana S Good question, this may be the docker equivalent of this quesiton: https://community.hortonworks.com/questions/146780/i-am-getting-blank-screen-in-horton-works-when-i-c.html?childToView=146797#answer-146797 Try the process again, and when it seems to hang, open a browser and navigate to: http://localhost:8080 or http://localhost:8888 Does anything come up? If so, it may just be that the last process in the Docker startup process is hanging, but by that time all necessary sandbox services should already be up. There are a number of enhancements on the way for Sandbox, but in the meantime try that out 🙂
... View more
11-02-2017
03:39 PM
1 Kudo
@Neeraj Morar To echo what Jay had posted below, if you haven't made many changes to the sandbox, the simplest solution may be to remove the VM and redeploy a fresh instance. Otherwise, it does indeed look like a filesystem issue which may be more difficult to diagnose than a fresh reboot would be.
... View more
11-02-2017
03:31 PM
2 Kudos
Hey @Divya Sodha, good concern. We're working on a slight redesign of this tutorial to ameliorate at least part of the issue of needing as much RAM, though some of the services are simply so beefy that there is a lower bound to required memory. One alternative to getting by with less memory for now is stopping any unused services via Ambari - that'll provide the largest savings. Pinging the designer of this particular tutorial, @jmedel, to see about including this as a step in the tutorial for the next update. Regarding your other question, about using SSD/flash drives would help - unfortunately, not really. The bottleneck here is RAM, and while increased disk space would help with swap, performance gains are very likely minimal considering average available disk space. Another alternative would be to leverage a single, large, platform rather than two sandboxes (HDP + HDF). I'll bring this up to the tutorial designer and get a discussion going.
... View more
11-01-2017
12:53 AM
@Neeraj Morar Ah, okay - there's the culprit. See where it says "Existed ... 8 hours ago"? The internal sandbox went ahead and exited for some reason. In that same shell you had open: "/root/start_sandbox.sh" ... or if that file doesn't exist, then "/root/start_scripts/start_sandbox.sh" (can't quite remember at the moment). Alternatively, you can power off and power on your VM. That should bring it back up.
... View more
11-01-2017
12:46 AM
@Chiranjeevi Nimmala I have to agree with Jay in that it seems like a filesystem issue. What you're seeing may be a consequence of running Docker (which powers the sandbox) on a not-very-compatible host (your Azure machine). It may be that the kernel needs updating or the filesystem isn't recommended for Docker. If it's not too much trouble, I'd suggest deploying the HDP sandbox on a different azure VM - try a different size VM or one with different specs. While it very well be something else, and I haven't really seen this issue come up on Azure boxes, this seems like the culprit.
... View more
11-01-2017
12:29 AM
1 Kudo
Heya, I just wanted to point out that while the names of the repositories are the same, the one that OP mentioned in the original post is much further along than the one currently in the hortonworks repository (as of this posting). People following along with the tutorial should refer to the repository specified in the tutorial. During the time of OP's posting, the repo went momentarily missing and so the link broke, but it's back up and the tutorial should be good to go.
... View more
10-31-2017
08:47 PM
1 Kudo
Heya @Neeraj Morar. Hmm, this is odd indeed. Let's get to the bottom of this! Could you try navigating to: http://localhost:8080 ? Anything show up? How are you logging into the VM? Do you have have access to a terminal? If so, punch in the following: ssh root@localhost -p 2222 (password is hadoop) Once in there, run "ambari-server restart" and report back 🙂 If nothing comes up on port 8080, it's possible Ambari is failing to come up. Here, we attempt to restart it. If you don't have access to a terminal, no biggie. Log back into the VM as you had done before and run the following: docker ps -a And let me know what get's printed. It's possible that the VM is running, but the sandbox somehow isn't running underneath. Thanks
... View more
10-31-2017
08:37 PM
Hey @Harish Kumar, sometimes VirtualBox does not display the final screen of the sequence. Could you try firing up your browser and navigating to http://localhost.com:8888 ? Does the sandbox splash page come up? If not, try giving it some time. If you're short on RAM, it may take a while for all the services to come up. We recommend at least 8GB for the HDP sandbox.
... View more
10-27-2017
10:52 PM
Heya, I just wanted to point out that while the names of the repositories are the same, the one that OP mentioned in the original post is further along than the one currently in the hortonworks repository (as of this posting). People following along with the tutorial should refer to the repository specified in the tutorial. Unfortunately, it went missing momentarily and so the link broke 😞 Should be back up, though 🙂
... View more
10-27-2017
10:47 PM
2 Kudos
Hey @ashok koti, as you pointed out there was an issue with the reference github repo, but it should be resolved. Following the steps in the tutorial should work once again. Let me know if you run into any other issues with the tutorial. 🙂
... View more
08-16-2017
12:50 AM
@Lucy Tan Could you verify the NiFi URL that you're trying? Is it "sandbox-hdf.hortonworks.com:9090/nifi"? Or using port 19090?
... View more
07-06-2017
09:03 AM
1 Kudo
Incremental Backup of Data from HDP to Azure using Falcon for Disaster Recovery and Burst capacity
Introduction
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Scenario
In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud hosted cluster . In our example:
This cluster is hosted on Windows Azure.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline.
Prerequisite
A cluster with Apache Hadoop 2.2 configured
A cluster with Apache Falcon configured
The easiest way to meet the above prerequisites is to download the HDP Sandbox
After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that:
if Ambari is not configured on your Sandbox, go http://127.0.0.1:8000/about/ and enable Ambari.
Once Ambari is enabled, navigate to Ambari at http://127.0.0.1:8080 , login with username and password of admin and admin respectively. Then check if Falcon is running.
If Falcon is not running, start Falcon:
Steps for the Scenario
Create cluster specification XML file
Create feed (aka dataset) specification XML file
* Reference cluster specification
Create the process specification XML file
* Reference cluster specification – defines where the process runs
* Reference feed specification – defines the datasets that the process manipulates
We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job.
Staging the component of the App on HDFS
In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS:
First download this zip file called falcon.zip to your local host machine.
Navigate using your browser to the Hue – File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS.
Navigate to /user/ambari-qa folder like below:
Now we will upload the zip file we just downloaded:
This should also unzip the zip file and create a folder structure with a folder called falcon .
Setting up the destination storage on Microsoft Azure
Login to the Windows Azure portal at http://manage.windowsazure.com
Create a storage account
Wait for the storage account to be provisioned
Copy the access key and the account name in a text document. We will use the access key and the account name in later steps
The other information you will want to note down is the blob endpoint of the storage account we just created
Click on the Containers tab and create a new container called myfirstcontainer .
Configuring access to Azure Blob store from Hadoop
Login to Ambari – http://127.0.0.1:8080 with the credentials admin and admin .
Then click on HDFS from the bar on the left and then select the Configs tab.
Scroll down to the bottom of the page to the Custom hdfs-site section and click on Add property...
In the Add Property dialog, the key name will start with fs.azure.account.key. followed by your blob endpoint that you noted down in a previous step. The value will be the Azure storage key that you noted down in a previous step. Once you have filled in the values click the Add button:
Once you are back out of the new key dialog you will have to Save it by clicking on the green Save button:
Then restart all the service by clicking on the orange Restart button:
Wait for all the restart to complete
Now let’s test if we can access our container on the Azure Blob Store.
SSH in to the VM:
ssh root@127.0.0.1 -p 2222;
The password is hadoop
hdfs dfs -ls -R wasb://myfirstcontainer@saptak.blob.core.windows.net/
Issue the command from our cluster on the SSH’d terminal
Staging the specifications
From the SSH session, first we will change our user to ambari-qa . Type:
su ambari-qa
Go to the users home directory:
cd ~
Download the topology, feed and process definitions:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falconDemo.zip
Unzip the file:
unzip ./falconDemo.zip
Change Directory to the folder created:
cd falconChurnDemo/
Now let’s modify the cleansedEmailFeed.xml to point the backup cluster to our Azure Blob Store container.
Use vi to edit the file:
Modify the value of location element of the backupCluster
to look like this:
Then save it and quit vi.
Submit the entities to the cluster:
Cluster Specification
Cluster specification is one per cluster.
See below for a sample cluster specification file.
Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center.
falcon entity -type cluster -submit -file oregonCluster.xml
Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center
falcon entity -type cluster -submit -file virginiaCluster.xml
If you view the XML file you will see how the cluster location and purpose has been captured in the XML file.
Feed Specification
A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time.
See below for a sample feed (a.k.a dataset) specification file.
Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process.
falcon entity -type feed -submit -file rawEmailFeed.xml
Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster)
falcon entity -type feed -submit -file cleansedEmailFeed.xml
Process
A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.
Here is an example of what a process specification looks like:
Back to our scenario, let’s submit the ingest and the cleanse process respectively:
The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals
falcon entity -type process -submit -file emailIngestProcess.xml
The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster
falcon entity -type process -submit -file cleanseEmailProcess.xml
Schedule the Falcon entities
So, all that is left now is to schedule the feeds and processes to get it going.
Ingest the feed
falcon entity -type feed -schedule -name rawEmailFeed
falcon entity -type process -schedule -name rawEmailIngestProcess
Cleanse the emails
falcon entity -type feed -schedule -name cleansedEmailFeed
falcon entity -type process -schedule -name cleanseEmailProcess
Processing
In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS:
In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted:
We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there.
... View more
- Find more articles tagged with:
- azure
- disaster-recovery
- Falcon
- How-ToTutorial
- Sandbox & Learning
Labels:
07-06-2017
08:42 AM
Define and Process Data Pipelines in Hadoop With Apache Falcon
Introduction
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationships between various data and processing elements and integrate with metastore/catalog such as Apache Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial we are going to walkthrough the process of:
Defining the feeds and processes
Defining and executing a data pipeline to ingest, process and persist data continuously
Prerequisites
Download Hortonworks Sandbox
Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into ambari as an administrator user.
Complete the Creating Falcon Cluster tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities.
Once you have downloaded the Hortonworks sandbox and run the VM, navigate to the Ambari interface on port 8080 of the host IP address of your Sandbox VM. Login with the username of admin and password that you set for the Ambari admin user as part of the Learning the Ropes of the Hortonworks Sandbox tutorial:
Outline
Scenario
Starting Falcon
Download and stage the dataset
Define the rawEmailFeed entity using Wizard
Define the rawEmailFeed entity using XML
Define the rawEmailIngestProcess entity using Wizard
Define the rawEmailIngestProcess entity using XML
Define the cleansedEmailFeed using Wizard
Define the cleansedEmailFeed using XML
Define the cleanseEmailProcess using Wizard
Define the cleanseEmailProcess using XML
Run the feeds
Run the processes
Input and Output of the pipeline
Summary
For this tutorial, we will utilize the virtualbox environment. Therefore, VMware and Azure Sandbox users your instructions may be slightly different.
Scenario
In this tutorial, we will walk through a scenario where email data lands hourly on a cluster. In our example:
This cluster is the primary cluster located in the Oregon data center.
Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs.
The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis.
To simulate this scenario, we have a Pig script grabbing the freely available Enron Corpus emails from the internet and feeding it into the pipeline.
Starting Falcon
By default, Falcon is not started on the sandbox, but you should have started the service while completing the Creating a Falcon Cluster tutorial. Do the following to verify that the Falcon service is started, or to start it if it was disabled.
In the Ambari UI, click on the Falcon icon in the left hand pane.
Then click on the Service Actions button on the top right.
Then, if the service is disabled, click on Start .
Once Falcon starts, Ambari should clearly indicate as below that the service has started:
Download and stage the dataset
Now let’s stage the dataset you will use for this tutorial. Although we perform many of these file operations below using the command line, you can also do the same with the HDFS Files View in Ambari.
Tip: You can copy and paste the commands from this tutorial.
First, enter the shell with your preferred shell client. For this tutorial, we will SSH into Hortonworks Sandbox with the command:
ssh root@127.0.0.1 -p 2222;
The default password is hadoop .
Then login as user hdfs :
su - hdfs
Then download the file falcon.zip with the following command:
wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falcon.zip
and then unzip with the command:
unzip falcon.zip
Now let’s give ourselves permission to upload files:
hadoop fs -chmod -R 777 /user/ambari-qa
Then let’s create a folder falcon under ambari-qa with the command:
hadoop fs -mkdir /user/ambari-qa/falcon
Now let’s upload the decompressed folder with the command
hadoop fs -copyFromLocal demo /user/ambari-qa/falcon/
Define the rawEmailFeed entity using Wizard
To create a feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Then enter the definition for the feed by giving the feed a unique name and a description. For this tutorial we will use
rawEmailFeed
and
Raw customer email feed.
Let’s also enter a tag key and value, so we can easily locate this Feed later:
externalSystem=USWestEmailServers
Feeds can be further categorised by identifying them with one or more groups. In this demo, we will group all the Feeds together by defining the group:
churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next.
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 1 hour. Change the timezone to UTC and click Next .
Enter the path of our data set on the Locations page:
/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
We will set the stats and meta paths to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for now. Click Next .
On the Clusters page select the cluster you created, then enter today’s date and the current time for the validity start time, and enter an hour or two later for the end time. The validity time specifies the period during which the feed will run. For many feeds, validity time will be set to the time the feed is scheduled to go into production and the end time will be set into the far future. Because we are running this tutorial on the Sandbox, we want to limit the time the process will run to conserve resources.
DO NOT forget to provide the validity time in the timezone you have selected.
Specify the retention period of 90 hours, then click Next .
Save the feed.
Define the rawEmailFeed entity using XML
Then click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
<tags>externalSystem=USWestEmailServers</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(1)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area.
Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values, save the feed.
Define the rawEmailIngestProcess entity using Wizard
Now lets define the rawEmailIngestProcess .
To create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the information below to create the process:
This job will run on the primaryCluster. Use the information below to create the process:
process name rawEmailIngestProcess
Tags email
With the value: testemail
And assign the workflow the name:
emailIngestWorkflow
Select Oozie as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/ingest/fs
Accept the default values for the Access Control List and click Next.
On the Properties page, select the time zone, and set the frequency to 1 hour. Set the number of parallel processes to 1. This prevents a new instance from starting prior to the previous one completing. Specify the order as first-in, first-out (FIFO).
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page, select primaryCluster again. As you did previously, modify the validity to a time slice which is in the very near future; set the validity to start now and end in an hour or two. Click Next .
Note : Time should be specified as per the timezone selected.
On the Inputs & Outputs page, ignore the Inputs section. For the output, enter a name and select the rawEmailFeed we created in the previous step and specify now(0,0) for the instance.
now(hours,minutes) : now refer to the instance start time. Hours and minutes given are in reference with the start time of instance. For example now(-2,40) corresponds to feed instance at -2 hr and +40 minutes i.e. feed instance 80 mins before the instance start time. now(0,0) corresponds to feed instance at current time.
Lets save the process.
Define the rawEmailIngestProcess entity using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>email=testemail</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailIngestWorkflow" version="4.0.1" engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time..
Click Finish on the top of the XML Preview area.
Verify all the values and click Next .
Save the process.
Define the cleansedEmailFeed using Wizard
As the name suggests, the raw feeds which you have created earlier get cleansed in the Process and the feed which comes out from that Process is the cleansed feed. So, to create a cleansed feed entity click on the Feed button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Use the following information to create the feed:
name cleansedEmailFeed
description Cleansed customer emails
tag cleanse with value cleaned
Group churnAnalysisDataPipeline
We then set the ownership information for the Feed:
Owner: ambari-qa
Group: users
Permissions: 755
For the Schema Location and Provider, enter “/none”, then click Next .
On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 4 hours. Change the timezone to UTC, accept the default values and click Next .
Set the default storage location to
/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Keep stats path and meta path as /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}.
Click Next .
On the Clusters page,select source cluster as a primary cluster ensure you modify the validity to a time slice which is in the very near future. Provide time in current UTC. Specify the path for the data as:
/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
And enter /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for the stats and meta data locations.
Keep the retention period as 90 hours.
Set the target cluster as backupCluster and again set the validity start for the current time and end time to an hour or two from now
And specify the data path for the target to
/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
Set the statistics and meta data locations to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
Enter the retention period as 90 hours for target cluster as well.
Click Next .
Verify all the entries you have filled out and then click Save .
Define the cleansedEmailFeed using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
<tags>cleanse=cleaned</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
<cluster name="backupCluster" type="target">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
<retention limit="hours(90)" action="delete"/>
<locations>
<location type="data" path="/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0755"/>
<schema location="/none" provider="/none"/>
</feed>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area. Save the feed.
Define the cleanseEmailProcess using Wizard
Now lets define the cleanseEmailProcess .
Again, to create a process entity click on the Process button on the top of the main page on the Falcon Web UI.
NOTE : If you want to create it from XML, skip this section, and move on to the next one.
Create this process with the following information
process name cleanseEmailProcess
Tag cleanse with the value yes
Then assign the workflow the name:
emailCleanseWorkflow
Select Pig as the execution engine and provide the following path:
/user/ambari-qa/falcon/demo/apps/pig/id.pig
Accept the default values for Access Control List and click Next .
For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.
Specify the order as first-in, First-out (FIFO)
And the Frequency to 1 hour.
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click Next .
On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click Next .
NOTE : Time should be specified as per the timezone selected.
Select the Input and Output Feeds as shown below and click Next.
Select rawEmailFeed as an input feed and cleansedEmailFeed as an output feed.
Verify all the values you have entered and then click Save .
Define the cleanseEmailProcess using XML
Click on the Edit button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
<tags>cleanse=yes</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<inputs>
<input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
</inputs>
<outputs>
<output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailCleanseWorkflow" version="pig-0.13.0" engine="pig" path="/user/ambari-qa/falcon/demo/apps/pig/id.pig"/>
<retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
<ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
NOTE : DO NOT copy the validity start and end time. Change it as per your time.
Click Finish on the top of the XML Preview area, verify all the values and save the process.
Run the feeds
From the Falcon Web UI home page search for the Feeds we created
Select the rawEmailFeed by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the cleansedEmailFeed in the same way
Run the processes
From the Falcon Web UI home page search for the Process we created
Select the cleanseEmailProcess by clicking on the checkbox
Then click on the Schedule button on the top of the search results
Next run the rawEmailIngestProcess in the same way
If you visit the Oozie process page, you can seen the processes running
Input and Output of the pipeline
Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS.
Here is the data being ingressed
and here is the data being egressed from the pipeline
Summary
In this tutorial we walked through a scenario to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis by defining a data pipeline with Apache Falcon. You may suspend the running feeds or processes if no longer required.
... View more
- Find more articles tagged with:
- Falcon
- How-ToTutorial
- pipeline
- Sandbox & Learning
Labels: