Mirroring Datasets Between Hadoop Clusters with Apache Falcon
Apache Falcon is a framework to simplify data pipeline processing and management on Hadoop clusters.
It provides data management services such as retention, replications across clusters, archival etc. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationship between various data and processing elements and integrate with metastore/catalog such as Hive/HCatalog. Finally it also lets you capture lineage information for feeds and processes.
In this tutorial we are going walk the process of mirroring the datasets between Hadoop clusters.
After creating cluster entities, let us go back to Ambari as admin user. Click on admin menu drop down and then Manage Ambari:
Click the blue Users button in the bottom box as given below:
Click the Create Local User button at the top of the page. Enter ambari-qa as the user name and then set the password for it. Enter it again for confirmation and Save the user.
You can see the newly added ambari-qa user. Click on it to assign it a group so that it can access Ambari views.
Write "views" and select it in Local Group Membership box and then click on tick mark to add an ambari-qa user in the "views" group.
Now logout of Ambari from the admin user and login to Ambari as ambari-qa user.
2. Preparing HDFS Directories
Select the Files View and you can view the following default folders:
Navigate to /user/ambari-qa and create a new directory falcon
Click on the row of falcon directory and then click on Permissions button:
Add Write permission for both Group and Others and then click Save.
Now create the directories mirrorSrc and mirrorTgt under /user/ambari-qa/falcon as the source and target of the mirroring job we are about to create.
After creating cluster entities, let’s go back to the SSH terminal, switch the user to root and then to ambari-qa:
Now we need to set permissions to allow access. You must be logged in as the owner of the directory /user/ambari-qa/falcon/
hadoop fs -chmod -R 777 /user/ambari-qa/falcon
3. Setting up the Mirroring Job
To create the mirroring job, go back to the Falcon UI on your browser and click on the Create drop down.
Click Mirror from the drop down menu, you will see a page like this:
Provide a name of your choice. The name must be unique to the system. We named the Mirror Job MirrorTest.
Ensure the File System mirror type is selected, then select the appropriate Source and Target and type in the appropriate paths. In our case the source cluster is primaryCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorSrc.
The target cluster is backupCluster and that HDFS path on the cluster is /user/ambari-qa/falcon/mirrorTgt.
Also set the validity of the job to your current time, so that when you attempt to run the job in a few minutes, the job is still within the validity period. Keep default values in Advanced Options and then Click Next.
Verify the summary information, then click Save:
4. Running the Job
Before we can run the job, we need some data to test on HDFS.
<!—Let’s give us permission to upload some data using the HDFS View in Ambari.
su - root
hadoop fs -chmod -R 775 /user/ambari-qa
Open Ambari from your browser at port 8080.
Then launch the HDFS view from the top right hand corner.
Keep login as ambari-qa and from the view on the Ambari console navigate to the directory /user/ambari-qa/falcon/mirrorSrc.
Click Upload button and upload any file you want to use.
Once uploaded the file should appear in the directory.
Now navigate to the Falcon UI and search for the job we created. The name of the Mirror job we had created was MirrorTest.
Select the MirrorTest job by clicking the checkbox and then click on Schedule.
The state of the job should change from SUBMITTED to RUNNING.
After a few minutes, use the HDFS View in the Ambari console to check the /user/ambari-qa/falcon/mirrorTgt directory and you should see that your data is mirrored.
In this tutorial we walked through the process of mirroring the datasets between two cluster entities.