I am looking at some information in understanding Falcon's role in replication scenario. For example, Oozie has a distcp action which helps in replicating datasets. If the same is done through Falcon instead of Oozie, what is the benefit? I understand that its more intuitive and better using Web UI. But assuming Oozie jobs are well-crafted, what extra functionality Falcon provides for this replication?
Also, what is the role of falcon service user and the various directories required for falcon user like staging, temp, working etc?
What value add has Falcon in Hive replication through it?
What is the role of Falcon interfaces?
Falcon provides capabilities not available in Oozie today, such as dependency management, simplified reprocessing of failed processes, and setting retention policies for datasets. Depending on the purpose and requirements of your replication, the capabilities in Oozie may be sufficient.
Falcon user owns the Falcon process on HDP. The Staging directory stores artifacts of processes and feeds, such as the feed/process definitions and job logs. The jar files needed to run processes and feeds are copied to the Working directory. Falcon uses the Temp directory to do intermediate processing of entities in HDFS. Falcon must have read/write/execute permission on these locations. These locations MUST be created prior to submitting a cluster entity to Falcon.
Staging 777 Parent directories require execute permissions so multiple users can write to this location.
Working/Temp 755 This is optional. If not specified, falcon creates a sub directory in the staging location.
The Falcon interfaces provide the address of the various services on the cluster that are required to support replication on the cluster. The requirement to enter these manually will be eliminated in a future release of Falcon through tighter integration with Ambari.
For the read interface, specify the endpoint for Hadoop's HFTP protocol.
For the write interface, enter the value of fs.defaultFS. Falcon uses this interface to write system data to hdfs and feeds referencing this cluster are written to hdfs using the same write interface.
For the execute interface, specify the interface for job tracker, it's endpoint is the value of mapreduce.jobtracker.address. Falcon uses this interface to submit the processes as jobs on JobTracker defined here.
For the workflow interface, specify the interface for the Oozie workflow engine, th $OOZIE_URL. Falcon uses this interface to schedule the processes referencing this cluster on Oozie.
For the registry interface, specify the interface for metadata catalog, such as Hive Metastore (or HCatalog). Falcon uses this interface to register/de-register partitions for a given database and table. Also, uses this information to schedule data availability events based on partitions in the workflow engine. Although Hive metastore supports both RPC and HTTP, Falcon comes with an implementation for RPC over thrift.
Using Falcon for Hive replication is especially valuable for Falcons retry capabilities.