Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

Solved Go to solution
Highlighted

How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

New Contributor

My understanding is if replication or mirroring is required then falcon is installed only on destination cluster in standalone mode. For data pipeline, install falcon where pipeline will be executed. Is my understanding correct? What is falcon prism(distributed mode) use? I cant find any reference. Any inputs will be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

Falcon can be installed on source or destination cluster. By design it can also run as a stand alone server [not part of any cluster]. There is no requirement that it has to be installed on both source and target cluster.

In Falcon today, replication can be achieved using

1> Feed replication : By default its a pull mechanism., replication job runs on target cluster. Falcon takes jobTracker and nameNode properties from target cluster entity execute and write end points resp. and submits the replication job to Oozie using the workflow endpoint specified in target cluster entity - this ensures the replication job runs on target cluster.

2> Mirroring using recipes: For recipes the user can configure the job to run on source or target cluster. If you look at the Mirroring falcon UI there is a option button "Run job here" for source and target. Default setting is ""Run job here" for target - so by default for HDFS mirroring or Hive mirroring, replication job runs on target cluster.

Design decision to use pull mechanism i.e. to run replication job on target cluster was to avoid class path issues if clusters use different Hadoop versions and other reasons. So running job on source cluster may not work always but user has the flexibility to do so.

For cases where one of the cluster involved in replication is a non Hadoop cluster say S3/ Azure, then replication job always run on Hadoop cluster.

Regarding Falcon prism/distributed mode Apache Falcon has this feature and InMobi uses it in their prod environment. This feature is not available in HDP.

4 REPLIES 4

Re: How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

New Contributor

@Anderw Ahn, @Balu I have an additional question/point to Mayank's question about cluster layout. I understand DR as definitely requiring Oozie to be configured in both locations because distcp will run on the destination cluster, and Hive replication will run on the source cluster. Isn't it also valid that a minimal Falcon install could be achieved by *only* setting up Falcon on the primary/source cluster? In this way, you define 2 clusters (primary, backup) and then simply schedule feeds and processes to run on the appropriate cluster. Falcon can schedule the job to run on Oozie either locally or remote. Please confirm.

TL;DR - a single Falcon install can control 2 clusters but requires Oozie installed on both clusters.

Re: How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

Falcon can be installed on source or destination cluster. By design it can also run as a stand alone server [not part of any cluster]. There is no requirement that it has to be installed on both source and target cluster.

In Falcon today, replication can be achieved using

1> Feed replication : By default its a pull mechanism., replication job runs on target cluster. Falcon takes jobTracker and nameNode properties from target cluster entity execute and write end points resp. and submits the replication job to Oozie using the workflow endpoint specified in target cluster entity - this ensures the replication job runs on target cluster.

2> Mirroring using recipes: For recipes the user can configure the job to run on source or target cluster. If you look at the Mirroring falcon UI there is a option button "Run job here" for source and target. Default setting is ""Run job here" for target - so by default for HDFS mirroring or Hive mirroring, replication job runs on target cluster.

Design decision to use pull mechanism i.e. to run replication job on target cluster was to avoid class path issues if clusters use different Hadoop versions and other reasons. So running job on source cluster may not work always but user has the flexibility to do so.

For cases where one of the cluster involved in replication is a non Hadoop cluster say S3/ Azure, then replication job always run on Hadoop cluster.

Regarding Falcon prism/distributed mode Apache Falcon has this feature and InMobi uses it in their prod environment. This feature is not available in HDP.

Re: How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

Rising Star

@Sowmya Ramesh Very good and detailed answer, thank you.

Re: How do we plan Falcon deployments for replication, mirroring and data pipeline on prod and DR clusters?

Thanks Balu!

Don't have an account?
Coming from Hortonworks? Activate your account here