About sramesh

sramesh · ‎01-27-2016

Before creating the cluster entities, we need to create the directories on HDFS representing the cluster that we are going to define, namely primaryCluster in your case. su - falcon hadoop fs -mkdir /apps/falcon/primaryCluster Further create directories called staging and working hadoop fs -mkdir /apps/falcon/primaryCluster/staging hadoop fs -mkdir /apps/falcon/primaryCluster/working Finally you need to set the proper permissions on the staging/working directories: hadoop fs -chmod 777 /apps/falcon/primaryCluster/staging hadoop fs -chmod 755 /apps/falcon/primaryCluster/working hadoop fs -chown -R falcon /apps/falcon/* You can refer to http://hortonworks.com/hadoop-tutorial/processing-data-pipeline-with-apache-falcon/ for more details.

sramesh · ‎10-28-2015

Falcon supports mirroring for HDFS and Hive. Performance issue I mentioned above is only for HDFS mirroring, if replicated data is not evicted. This is because for Hive mirroring , last successfully replicated event id will be saved in the data store by Falcon and next replication job will start replication past the last successfully replicated event id. Also Falcon cleans up staging paths used for export after the job runs. As DistCP will get only the new data to be replicated there is no performance overhead for Hive mirroring. Just an FYI.

sramesh · ‎10-28-2015

https://hortonworks.jira.com/browse/BUG-46884 has been created to track the UI issue.

sramesh · ‎10-28-2015

For mirroring using recipes you can do it using cmd line. I will create a bug to track mirroring UI not having a way to include mirror job parameters. Thanks for bringing that up!

sramesh · ‎10-27-2015

@Sean Roberts: Done!

sramesh · ‎10-27-2015

Thanks Balu!

sramesh · ‎10-27-2015

Falcon can be installed on source or destination cluster. By design it can also run as a stand alone server [not part of any cluster]. There is no requirement that it has to be installed on both source and target cluster. In Falcon today, replication can be achieved using 1> Feed replication : By default its a pull mechanism., replication job runs on target cluster. Falcon takes jobTracker and nameNode properties from target cluster entity execute and write end points resp. and submits the replication job to Oozie using the workflow endpoint specified in target cluster entity - this ensures the replication job runs on target cluster. 2> Mirroring using recipes: For recipes the user can configure the job to run on source or target cluster. If you look at the Mirroring falcon UI there is a option button "Run job here" for source and target. Default setting is ""Run job here" for target - so by default for HDFS mirroring or Hive mirroring, replication job runs on target cluster. Design decision to use pull mechanism i.e. to run replication job on target cluster was to avoid class path issues if clusters use different Hadoop versions and other reasons. So running job on source cluster may not work always but user has the flexibility to do so. For cases where one of the cluster involved in replication is a non Hadoop cluster say S3/ Azure, then replication job always run on Hadoop cluster. Regarding Falcon prism/distributed mode Apache Falcon has this feature and InMobi uses it in their prod environment. This feature is not available in HDP.

sramesh · ‎10-27-2015

If you are asking about using recipes for mirroring, if you look at hdfs-replication-template.xml or hive-replication-template.xml has below properties set to ensure only one instance runs. <parallel>1</parallel>  <order>LAST_ONLY</order> -------------------------------------------------------------------------------------------------------------- In general, in Falcon this can be controlled using <parallel> and <order> properties in the entity xml. Parallel decides the concurrent replication instances that can run at any given time and order decides the execution order for replication instances like FIFO, LIFO and LAST_ONLY. For replication job needs to run only once to catch up. Setting below configs in entity xml will ensure this: <parallel>1</parallel> <order>LAST_ONLY</order> Parallel maps to concurrency and order maps to execution in Oozie. Please refer Oozie doc for more details.

sramesh · ‎10-27-2015

Today replication in Falcon can be achieved using two ways: 1> Feed based Replication: Falcon uses pull based replication mechanism, meaning in every target cluster, for a given source cluster, a coordinator is scheduled which pulls the data using DistCp from source cluster. This requires data locations to be replicated to have dated partitions. 2> Using concept of Recipes: HDFS Directory Replication Recipe Overview This recipe implements replicating arbitrary directories on HDFS from one Hadoop cluster to another Hadoop cluster. This piggy backs on replication solution in Falcon which uses the DistCp tool. Use Case * Copy directories between HDFS clusters with out dated partitions * Archive directories from HDFS to Cloud. Ex: S3, Azure WASB Limitations As the data volume and number of files grow, this can get inefficient. User should make sure data already replicated is evicted else it will have performance issues. For both of the above mechanisms, DistCp options can be passed as custom properties, which will be propagated to the DistCp tool. maxMaps represents the maximum number of maps used during replication mapBandwidth represents the bandwidth in MB/s used by each mapper during replication overwrite represents overwrite destination during replication ignoreErrors represents ignore failures not causing the job to fail during replication skipChecksum represents bypassing checksum verification during replication removeDeletedFiles represents deleting the files existing in the destination but not in source during replication preserveBlockSize represents preserving block size during replication preserveReplicationNumber represents preserving replication number during replication preservePermission represents preserving permission during replication

sramesh · ‎10-14-2015

Yes, webHDFS is supported by falcon.

Online	Offline
Last Visited	‎05-25-2017 06:03 PM

Member Since	‎09-29-2015 08:29 PM
Last Visited	‎05-25-2017 06:03 PM
Posts	57
Kudos received	49

Cloudera Community

Re: Falcon: E0501: Could not perform authorization...

Re: Falcon HDFS Mirror deletions

Re: falcon replication - overlap

Re: Is there any way to stop all falcon feeds and ...

Re: Is it possible to use S3 for Falcon feeds?

Re: I am working with Falcon, while creating the e...

Re: Falcon mirroring assumptions and guarantees

Re: Falcon mirroring assumptions and guarantees

Re: Falcon mirroring assumptions and guarantees

Re: Falcon mirror recipe: What happens if an insta...

Re: How do we plan Falcon deployments for replicat...

Re: How do we plan Falcon deployments for replicat...

Re: Falcon mirror recipe: What happens if an insta...

Re: Falcon mirroring assumptions and guarantees

Re: Can webHDFS be used for replication in Falcon?