Member since 
    
	
		
		
		10-19-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                151
            
            
                Posts
            
        
                59
            
            
                Kudos Received
            
        
                17
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2507 | 03-22-2018 11:48 AM | |
| 3954 | 01-12-2018 06:25 PM | |
| 7112 | 01-12-2018 03:56 AM | |
| 9229 | 01-12-2018 03:38 AM | |
| 4763 | 01-02-2018 10:29 PM | 
			
    
	
		
		
		11-28-2017
	
		
		05:36 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Thanks! It works with 2.6.3. Didn't realize both script and sandbox were updated just a few days ago.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-28-2017
	
		
		04:01 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks for not leaving the thread with out  proper solution . @Rafael Coss  HCC  rocks 🙂   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-06-2017
	
		
		09:03 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Incremental Backup of Data from HDP to Azure using Falcon for Disaster Recovery and Burst capacity 
 Introduction 
 Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components. 
 Scenario 
 In this tutorial we will walk through a scenario where email data gets processed on multiple HDP 2.2 clusters around the country then gets backed up hourly on a cloud hosted cluster . In our example: 
 
   This cluster is hosted on Windows Azure. 
   Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs. 
 
 The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis. 
 To simulate this scenario, we have a pig script grabbing the freely available Enron emails from the internet and feeding it into the pipeline. 
    
 Prerequisite 
 
   A cluster with Apache Hadoop 2.2 configured 
   A cluster with Apache Falcon configured 
 
 The easiest way to meet the above prerequisites is to download the HDP Sandbox 
 After downloading the environment, confirm that Apache Falcon is running. Below are the steps to validate that: 
 
   if Ambari is not configured on your Sandbox, go  http://127.0.0.1:8000/about/  and enable Ambari. 
 
    
 
   Once Ambari is enabled, navigate to Ambari at  http://127.0.0.1:8080 , login with username and password of admin  and  admin  respectively. Then check if Falcon is running. 
 
    
 
   If Falcon is not running, start Falcon: 
 
    
 Steps for the Scenario 
 
   Create cluster specification XML file 
   Create feed (aka dataset) specification XML file
    * Reference cluster specification 
   Create the process specification XML file
    * Reference cluster specification – defines where the process runs
    * Reference feed specification – defines the datasets that the process manipulates 
 
 We have already created the necessary xml files. In this step we are going to download the specifications and use them to define the topology and submit the storm job. 
 Staging the component of the App on HDFS 
 In this step we will stage the pig script and the necessary folder structure for inbound and outbound feeds on the HDFS: 
 First download this zip file called  falcon.zip  to your local host machine. 
 Navigate using your browser to the Hue – File Browser interface at http://127.0.0.1:8000/filebrowser/ to explore the HDFS. 
 Navigate to  /user/ambari-qa  folder like below: 
    
 Now we will upload the zip file we just downloaded: 
    
 This should also unzip the zip file and create a folder structure with a folder called  falcon . 
 Setting up the destination storage on Microsoft Azure 
 Login to the Windows Azure portal at http://manage.windowsazure.com 
    
 Create a storage account 
    
 Wait for the storage account to be provisioned 
    
 Copy the access key and the account name in a text document. We will use the access key and the account name in later steps 
    
 The other information you will want to note down is the blob endpoint of the storage account we just created 
    
 Click on the  Containers  tab and create a new container called  myfirstcontainer . 
    
 Configuring access to Azure Blob store from Hadoop 
 Login to Ambari – http://127.0.0.1:8080 with the credentials  admin  and  admin . 
    
 Then click on HDFS from the bar on the left and then select the  Configs  tab. 
    
 Scroll down to the bottom of the page to the  Custom hdfs-site  section and click on  Add property...  
    
 In the  Add Property  dialog, the key name will start with  fs.azure.account.key.  followed by your blob endpoint that you noted down in a previous step. The value will be the Azure storage key that you noted down in a previous step. Once you have filled in the values click the  Add  button: 
    
 Once you are back out of the new key dialog you will have to  Save  it by clicking on the green  Save  button: 
    
 Then restart all the service by clicking on the orange  Restart  button: 
    
 Wait for all the restart to complete 
    
 Now let’s test if we can access our container on the Azure Blob Store. 
 SSH in to the VM: 
  ssh root@127.0.0.1 -p 2222;  
 The password is  hadoop  
  hdfs dfs -ls -R wasb://myfirstcontainer@saptak.blob.core.windows.net/  
 Issue the command from our cluster on the SSH’d terminal 
    
 Staging the specifications 
 From the SSH session, first we will change our user to  ambari-qa . Type: 
  su ambari-qa  
 Go to the users home directory: 
  cd ~  
 Download the topology, feed and process definitions: 
  wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falconDemo.zip  
    
 Unzip the file: 
  unzip ./falconDemo.zip  
 Change Directory to the folder created: 
  cd falconChurnDemo/  
 Now let’s modify the  cleansedEmailFeed.xml  to point the backup cluster to our Azure Blob Store container. 
 Use  vi  to edit the file: 
    
 Modify the value of  location  element of the  backupCluster  
    
 to look like this: 
    
 Then save it and quit vi. 
 Submit the entities to the cluster: 
 Cluster Specification 
 Cluster specification is one per cluster. 
 See below for a sample cluster specification file. 
    
 Back to our scenario, lets submit the ‘oregon cluster’ entity to Falcon. This signifies the primary Hadoop cluster located in the Oregon data center. 
  falcon entity -type cluster -submit -file oregonCluster.xml  
 Then lets submit the ‘virginia cluster’ entity to Falcon. This signifies the backup Hadoop cluster located in the Virginia data center 
  falcon entity -type cluster -submit -file virginiaCluster.xml  
 If you view the XML file you will see how the cluster location and purpose has been captured in the XML file. 
 Feed Specification 
 A feed (a.k.a dataset) signifies a location of data and its associated replication policy and late arrival cut-off time. 
 See below for a sample feed (a.k.a dataset) specification file. 
    
 Back to our scenario, let’s submit the source of the raw email feed. This feed signifies the raw emails that are being downloaded into the Hadoop cluster. These emails will be used by the email cleansing process. 
  falcon entity -type feed -submit -file rawEmailFeed.xml  
 Now let’s define the feed entity which will handle the end of the pipeline to store the cleansed email. This feed signifies the emails produced by the cleanse email process. It also takes care of replicating the cleansed email dataset to the backup cluster (virginia cluster) 
  falcon entity -type feed -submit -file cleansedEmailFeed.xml  
 Process 
 A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on. 
 Here is an example of what a process specification looks like: 
    
 Back to our scenario, let’s submit the ingest and the cleanse process respectively: 
 The ingest process is responsible for calling the Oozie workflow that downloads the raw emails from the web into the primary Hadoop cluster under the location specified in the rawEmailFeed.xml It also takes care of handling late data arrivals 
  falcon entity -type process -submit -file emailIngestProcess.xml  
 The cleanse process is responsible for calling the pig script that cleans the raw emails and produces the clean emails that are then replicated to the backup Hadoop cluster 
  falcon entity -type process -submit -file cleanseEmailProcess.xml  
 Schedule the Falcon entities 
 So, all that is left now is to schedule the feeds and processes to get it going. 
 Ingest the feed 
  falcon entity -type feed -schedule -name rawEmailFeed  
  falcon entity -type process -schedule -name rawEmailIngestProcess  
 Cleanse the emails 
  falcon entity -type feed -schedule -name cleansedEmailFeed  
  falcon entity -type process -schedule -name cleanseEmailProcess  
 Processing 
 In a few seconds you should notice that that Falcon has started ingesting files from the internet and dumping them to new folders like below on HDFS: 
    
 In a couple of minutes you should notice a new folder called processed under which the files processed through the data pipeline are being emitted: 
    
 We just created an end-to-end data pipeline to process data. The power of the Apache Falcon framework is its flexibility to work with pretty much any open source or proprietary data processing products out there. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		07-06-2017
	
		
		08:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Define and Process Data Pipelines in Hadoop With Apache Falcon 
 Introduction 
 Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters. 
 It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationships between various data and processing elements and integrate with metastore/catalog such as Apache Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial we are going to walkthrough the process of: 
 
   Defining the feeds and processes 
   Defining and executing a data pipeline to ingest, process and persist data continuously 
 
 Prerequisites 
 
   Download Hortonworks Sandbox 
   Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into ambari as an administrator user. 
   Complete the Creating Falcon Cluster tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities. 
 
 Once you have downloaded the Hortonworks sandbox and run the VM, navigate to the Ambari interface on port  8080  of the host IP address of your Sandbox VM. Login with the username of  admin  and password that you set for the Ambari admin user as part of the  Learning the Ropes of the Hortonworks Sandbox  tutorial: 
    
 Outline 
 
   Scenario 
   Starting Falcon 
   Download and stage the dataset 
   Define the rawEmailFeed entity using Wizard 
   Define the rawEmailFeed entity using XML 
   Define the rawEmailIngestProcess entity using Wizard 
   Define the rawEmailIngestProcess entity using XML 
   Define the cleansedEmailFeed using Wizard 
   Define the cleansedEmailFeed using XML 
   Define the cleanseEmailProcess using Wizard 
   Define the cleanseEmailProcess using XML 
   Run the feeds 
   Run the processes 
   Input and Output of the pipeline 
   Summary 
 
 For this tutorial, we will utilize the virtualbox environment. Therefore, VMware and Azure Sandbox users your instructions may be slightly different. 
 Scenario  
 In this tutorial, we will walk through a scenario where email data lands hourly on a cluster. In our example: 
 
   This cluster is the primary cluster located in the Oregon data center. 
   Data arrives from all the West Coast production servers. The input data feeds are often late for up to 4 hrs. 
 
 The goal is to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis. 
 To simulate this scenario, we have a Pig script grabbing the freely available Enron Corpus emails from the internet and feeding it into the pipeline. 
    
 Starting Falcon  
 By default, Falcon is not started on the sandbox, but you should have started the service while completing the  Creating a Falcon Cluster  tutorial. Do the following to verify that the Falcon service is started, or to start it if it was disabled. 
 In the Ambari UI,  click on the Falcon icon in the left hand pane. 
 Then click on the Service Actions button on the top right. 
 Then, if the service is disabled, click on  Start . 
 Once Falcon starts, Ambari should clearly indicate as below that the service has started: 
    
 Download and stage the dataset  
 Now let’s stage the dataset you will use for this tutorial. Although we perform many of these file operations below using the command line, you can also do the same with the  HDFS Files  View  in Ambari. 
 
   Tip: You can copy and paste the commands from this tutorial. 
 
 First, enter the shell with your preferred shell client. For this tutorial, we will SSH into Hortonworks Sandbox with the command: 
   ssh root@127.0.0.1 -p 2222;
  
 
    
 The default password is  hadoop . 
 Then login as user  hdfs : 
   su - hdfs
  
 
 Then download the file falcon.zip with the following command: 
   wget http://hortonassets.s3.amazonaws.com/tutorial/falcon/falcon.zip
  
 
 and then unzip with the command: 
   unzip falcon.zip
  
 
    
 Now let’s give ourselves permission to upload files: 
   hadoop fs -chmod -R 777 /user/ambari-qa
  
 
 Then let’s create a folder  falcon  under  ambari-qa  with the command: 
   hadoop fs -mkdir /user/ambari-qa/falcon
  
 
    
 Now let’s upload the decompressed folder with the command 
   hadoop fs -copyFromLocal demo /user/ambari-qa/falcon/
  
 
    
 Define the rawEmailFeed entity using Wizard  
 To create a feed entity click on the  Feed  button on the top of the main page on the Falcon Web UI. 
 NOTE : If you want to create it from XML, skip this section, and move on to the next one. 
 Then enter the definition for the feed by giving the feed a unique name and a description.  For this tutorial we will use 
   rawEmailFeed
  
 
 and 
   Raw customer email feed.
  
 
 Let’s also enter a tag key and value, so we can easily locate this Feed later: 
   externalSystem=USWestEmailServers
  
 
 Feeds can be further categorised by identifying them with one or more groups.  In this demo, we will group all the Feeds together by defining the group: 
   churnAnalysisDataPipeline
  
 
 We then set the ownership information for the Feed: 
   Owner:  ambari-qa
Group:  users
Permissions: 755
  
 
 For the Schema Location and Provider, enter “/none”, then click Next. 
    
 On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 1 hour. Change the timezone to UTC and click  Next . 
    
 Enter the path of our data set on the Locations page: 
   /user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
  
 
 We will set the stats and meta paths to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for now. Click  Next . 
    
 On the Clusters page select the cluster you created, then enter today’s date and the current time for the validity start time, and enter an hour or two later for the end time.  The validity time specifies the period during which the feed will run.  For many feeds, validity time will be set to the time the feed is scheduled to go into production and the end time will be set into the far future.  Because we are running this tutorial on the Sandbox, we want to limit the time the process will run to conserve resources. 
 DO NOT forget to provide the validity time in the timezone you have selected. 
 Specify the retention period of 90 hours, then click  Next . 
    
 Save the feed. 
    
 Define the rawEmailFeed entity using XML  
 Then click on the  Edit  button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below: 
   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(1)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
            <retention limit="hours(90)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0755"/>
    <schema location="/none" provider="/none"/>
</feed>
  
 
 NOTE : DO NOT copy the validity start and end time. Change it as per your time. 
 Click  Finish  on the top of the XML Preview area. 
    
 Falcon UI should have automatically parsed out the values from the XML and populated in the right fields. Once you have verified that these are the correct values, save the feed. 
 Define the rawEmailIngestProcess entity using Wizard  
 Now lets define the  rawEmailIngestProcess . 
 To create a process entity click on the  Process  button on the top of the main page on the Falcon Web UI. 
 NOTE : If you want to create it from XML, skip this section, and move on to the next one. 
 Use the information below to create the process: 
 This job will run on the primaryCluster. Use the information below to create the process: 
   process name rawEmailIngestProcess
Tags email
With the value: testemail
  
 
 And assign the workflow the name: 
   emailIngestWorkflow
  
 
 Select Oozie as the execution engine and provide the following path: 
   /user/ambari-qa/falcon/demo/apps/ingest/fs
  
 
 Accept the default values for the Access Control List and click Next. 
    
 On the Properties page, select the time zone, and set the frequency to 1 hour. Set the number of parallel processes to 1. This prevents a new instance from starting prior to the previous one completing. Specify the order as first-in, first-out (FIFO).
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click  Next . 
    
 On the Clusters page, select primaryCluster again. As you did previously, modify the validity to a time slice which is in the very near future; set the validity to start now and end in an hour or two. Click  Next . 
 Note : Time should be specified as per the timezone selected. 
    
 On the Inputs & Outputs page, ignore the Inputs section. For the output, enter a name and select the  rawEmailFeed  we created in the previous step and specify now(0,0) for the instance. 
  now(hours,minutes) : now refer to the instance start time. Hours and minutes given are in reference with the start time of instance.  For example now(-2,40) corresponds to feed instance at -2 hr and +40 minutes i.e. feed instance 80 mins before the instance start time. now(0,0) corresponds to feed instance at current time. 
    
 Lets save the process. 
    
 Define the rawEmailIngestProcess entity using XML  
 Click on the  Edit  button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below: 
   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
    <tags>email=testemail</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <outputs>
        <output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailIngestWorkflow" version="4.0.1" engine="oozie" path="/user/ambari-qa/falcon/demo/apps/ingest/fs"/>
    <retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
    <ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
  
 
 NOTE : DO NOT copy the validity start and end time. Change it as per your time.. 
 Click  Finish  on the top of the XML Preview area. 
    
 Verify all the values and click  Next . 
 Save the process. 
 Define the cleansedEmailFeed using Wizard  
 As the name suggests, the raw feeds which you have created earlier get cleansed in the Process and the feed which comes out from that Process is the cleansed feed. So, to create a cleansed feed entity click on the  Feed  button on the top of the main page on the Falcon Web UI. 
 NOTE : If you want to create it from XML, skip this section, and move on to the next one. 
 Use the following information to create the feed: 
   name cleansedEmailFeed
description Cleansed customer emails
tag cleanse with value cleaned
Group churnAnalysisDataPipeline
  
 
 We then set the ownership information for the Feed: 
   Owner:  ambari-qa
Group:  users
Permissions: 755
  
 
 For the Schema Location and Provider, enter “/none”, then click  Next . 
    
 On the Properties page, specify to run the job hourly by specifying the frequency as 1 hour, check Late Arrival Checkbox and specify the value as 4 hours. Change the timezone to UTC, accept the default values and click  Next . 
    
 Set the default storage location to 
   /user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
  
 
 Keep stats path and meta path as /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}.
Click  Next . 
    
 On the Clusters page,select source cluster as a primary cluster ensure you modify the validity to a time slice which is in the very near future. Provide time in current UTC. Specify the path for the data as: 
   /user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
  
 
 And enter /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR} for the stats and meta data locations.
Keep the retention period as 90 hours. 
 Set the target cluster as backupCluster and again set the validity start for the current time and end time to an hour or two from now
And specify the data path for the target to 
   /falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}
  
 
 Set the statistics and meta data locations to /tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}
Enter the retention period as 90 hours for target cluster as well.
Click  Next . 
    
    
 Verify all the entries you have filled out and then click  Save . 
    
 Define the cleansedEmailFeed using XML  
 Click on the  Edit  button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below: 
   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
    <tags>cleanse=cleaned</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(4)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
            <retention limit="hours(90)" action="delete"/>
            <locations>
                <location type="data" path="/user/ambari-qa/falcon/demo/primary/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
                <location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
                <location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
            </locations>
        </cluster>
        <cluster name="backupCluster" type="target">
            <validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
            <retention limit="hours(90)" action="delete"/>
            <locations>
                <location type="data" path="/falcon/demo/bcp/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
                <location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
                <location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
            </locations>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/ambari-qa/falcon/demo/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="meta" path="/tmp/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>
    <ACL owner="ambari-qa" group="users" permission="0755"/>
    <schema location="/none" provider="/none"/>
</feed>
  
 
 NOTE : DO NOT copy the validity start and end time. Change it as per your time. 
 Click  Finish  on the top of the XML Preview area. Save the feed. 
    
 Define the cleanseEmailProcess using Wizard  
 Now lets define the  cleanseEmailProcess .
Again, to create a process entity click on the  Process  button on the top of the main page on the Falcon Web UI. 
 NOTE : If you want to create it from XML, skip this section, and move on to the next one. 
 Create this process with the following information 
   process name cleanseEmailProcess
  
 
 Tag cleanse with the value yes 
 Then assign the workflow the name: 
   emailCleanseWorkflow
  
 
 Select Pig as the execution engine and provide the following path: 
   /user/ambari-qa/falcon/demo/apps/pig/id.pig
  
 
 Accept the default values for Access Control List and click  Next . 
    
 For the properties, set the number of parallel processes to 1, this prevents a new instance from starting prior to the previous one completing.
Specify the order as first-in, First-out (FIFO)
And the Frequency to 1 hour.
Select the exp-backoff retry policy, then set the attempts to 3 and the delay to 3 minutes. Click  Next . 
    
 On the Clusters page ensure you modify the validity to a time slice which is in the very near future and then click  Next . 
 NOTE : Time should be specified as per the timezone selected. 
    
 Select the Input and Output Feeds as shown below and click Next.
Select rawEmailFeed as an input feed and cleansedEmailFeed as an output feed. 
    
 Verify all the values you have entered and then click  Save . 
    
    
 Define the cleanseEmailProcess using XML  
 Click on the  Edit  button over XML Preview area on the right hand side of the screen and replace the XML content with the XML document below: 
   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
    <tags>cleanse=yes</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2016-06-05T05:00Z" end="2016-06-05T06:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <inputs>
        <input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
    </inputs>
    <outputs>
        <output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailCleanseWorkflow" version="pig-0.13.0" engine="pig" path="/user/ambari-qa/falcon/demo/apps/pig/id.pig"/>
    <retry policy="exp-backoff" delay="minutes(3)" attempts="3"/>
    <ACL owner="ambari-qa" group="users" permission="0755"/>
</process>
  
 
 NOTE : DO NOT copy the validity start and end time. Change it as per your time. 
 Click  Finish  on the top of the XML Preview area, verify all the values and save the process. 
    
 Run the feeds  
 From the Falcon Web UI home page search for the Feeds we created 
    
 Select the rawEmailFeed by clicking on the checkbox 
    
 Then click on the Schedule button on the top of the search results 
    
 Next run the  cleansedEmailFeed  in the same way 
    
 Run the processes  
 From the Falcon Web UI home page search for the Process we created 
    
 Select the  cleanseEmailProcess  by clicking on the checkbox 
    
 Then click on the Schedule button on the top of the search results 
    
 Next run the  rawEmailIngestProcess  in the same way 
    
 If you visit the Oozie process page, you can seen the processes running 
    
 Input and Output of the pipeline  
 Now that the feeds and processes are running, we can check the dataset being ingressed and the dataset egressed on HDFS. 
    
 Here is the data being ingressed 
    
 and here is the data being egressed from the pipeline 
    
 Summary  
 In this tutorial we walked through a scenario to clean the raw data to remove sensitive information like credit card numbers and make it available to our marketing data science team for customer churn analysis by defining a data pipeline with Apache Falcon. You may suspend the running feeds or processes if no longer required. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		07-06-2017
	
		
		08:38 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Mirroring Datasets Between Hadoop Clusters with Apache Falcon 
 Introduction 
 Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters. 
 It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes. 
 In this tutorial we are going walk the process of mirroring the datasets between Hadoop clusters. 
 Prerequisites 
 
   Download Hortonworks Sandbox 2.5 
   Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into Ambari as an administrator user. 
   Complete the Leveraging Apache Falcon with Your Hadoop Clusters tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities. 
 
 Outline 
 
   1. Create ambari-qa user 
   2. Preparing HDFS Directories 
   3. Setting up the Mirroring Job 
   4. Running the Job 
   Summary 
 
 1. Create ambari-qa user  
 After creating cluster entities, let us go back to Ambari as  admin  user. Click on admin menu drop down and then  Manage Ambari : 
    
 Click the blue  Users  button in the bottom box as given below: 
    
 Click the  Create Local User  button at the top of the page. Enter  ambari-qa  as the user name and then set the password for it. Enter it again for confirmation and  Save  the user. 
    
 You can see the newly added  ambari-qa  user. Click on it to assign it a group so that it can access Ambari views.
Write  "views"  and select it in  Local Group Membership   box and then click on  tick  mark to add an  ambari-qa  user in the  "views"  group. 
    
 Now logout of Ambari from the  admin  user and login to Ambari as  ambari-qa  user. 
 2. Preparing HDFS Directories  
 Select the Files View and you can view the following default folders: 
    
 Navigate to  /user/ambari-qa  and create a new directory  falcon 
Click on the row of  falcon  directory and then click on  Permissions  button: 
    
 Add  Write  permission for both Group and Others and then click  Save . 
    
 Now create the directories  mirrorSrc  and  mirrorTgt  under  /user/ambari-qa/falcon  as the source and target of the mirroring job we are about to create. 
   
<!—
After creating cluster entities, let’s go back to the SSH terminal, switch the user to  root  and then to  ambari-qa : 
   hadoop fs -mkdir /user/ambari-qa/falcon
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorSrc
hadoop fs -mkdir /user/ambari-qa/falcon/mirrorTgt
  
 
    
 Now we need to set permissions to allow access. You must be logged in as the owner of the directory  /user/ambari-qa/falcon/  
   hadoop fs -chmod -R 777 /user/ambari-qa/falcon
  
 
    
 –> 
 3. Setting up the Mirroring Job  
 To create the mirroring job, go back to the Falcon UI on your browser and click on the  Create  drop down. 
    
 Click  Mirror  from the drop down menu, you will see a page like this: 
    
 Provide a name of your choice. The name must be unique to the system. We named the Mirror Job  MirrorTest . 
 Ensure the File System mirror type is selected, then select the appropriate Source and Target and type in the appropriate paths. In our case the source cluster is  primaryCluster  and that HDFS path on the cluster is  /user/ambari-qa/falcon/mirrorSrc . 
 The target cluster is  backupCluster  and that HDFS path on the cluster is  /user/ambari-qa/falcon/mirrorTgt .
Also set the validity of the job to your current time, so that when you attempt to run the job in a few minutes, the job is still within the validity period. Keep default values in Advanced Options and then Click  Next . 
    
    
 Verify the summary information, then click  Save : 
    
    
 4. Running the Job  
 Before we can run the job, we need some data to test on HDFS.
<!—Let’s give us permission to upload some data using the HDFS View in Ambari. 
   su - root
su hdfs
hadoop fs -chmod -R 775 /user/ambari-qa
  
 
 Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
–>
Keep login as ambari-qa and from the view on the Ambari console navigate to the directory  /user/ambari-qa/falcon/mirrorSrc . 
    
 Click  Upload  button and upload any file you want to use. 
    
 Once uploaded the file should appear in the directory. 
    
 Now navigate to the Falcon UI and search for the job we created. The name of the Mirror job we had created was  MirrorTest . 
    
 Select the  MirrorTest  job by clicking the checkbox and then click on  Schedule . 
    
 The state of the job should change from  SUBMITTED  to  RUNNING . 
    
 After a few minutes, use the HDFS View in the Ambari console to check the  /user/ambari-qa/falcon/mirrorTgt  directory and you should see that  your data is mirrored. 
    
 Summary  
 In this tutorial we walked through the process of mirroring the datasets between two cluster entities. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		03-15-2018
	
		
		06:48 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I am getting following error while creating a cluster. Note that yarn is showing perfectly fine in Ambari. Appreciate if somebody can help  2018-03-15 18:43:17,171 INFO  - [1241480588@qtp-1990385139-0 - d41607f2-0715-4016-8a10-952937969cd0:amber-qa:POST//entities/submit/cluster] ~ Retrying connect to server: sandbox-hdp.hortonworks.com/172.17.0.2:8050. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) (Client:906) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-23-2018
	
		
		01:25 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 after some time and the issue showing up from time to time I discovered it was an IP problem . using the virtualbox fixed it because it uses a default different IP . the problem with the IP was caused by some limitations set up in my computer by my company. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2017
	
		
		04:16 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 hive.server2.transport.mode values are either binary or http.  hive.server2.authentication – Authentication mode, default NONE. Options are NONE (uses plain SASL), NOSASL, KERBEROS, LDAP, PAM and CUSTOM.  Read more at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Authentication/SecurityConfiguration.  I believe Tableau is making you select the authentication mode and not the transport mode. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-20-2017
	
		
		05:50 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 ssh maria_dev@xxx.xxx.xxx.xxx -p 2222  Enter password maria_dev 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-10-2017
	
		
		06:15 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @ssanthosh   Perfect, thanks! 
						
					
					... View more