Created on 03-13-2017 11:35 PM
Recipes framework capability to support HDFS and Hive mirroring was added in Apache Falcon 0.6.0 release and it was a client side logic. With 0.10 release its moved to server side and renamed as server side extensions as part of jira https://issues.apache.org/jira/browse/FALCON-1107.
For any new mirror job to be submitted and managed, Falcon extensions should be used. Please refer https://falcon.apache.org/restapi/ExtensionEnumeration.html for more details.
Supported DistCp options for HDFS mirroring in HDP 2.5:
Hdfs mirroring job can be scheduled using extension as below:
falcon extension -submitAndSchedule -extensionName hdfs-mirroring -file sales-monthly.properties Content of sales-monthly.properties file: jobName=sales-monthly jobValidityStart=2016-06-30T00:00Z jobValidityEnd=2099-12-31T11:59Z jobFrequency=minutes(45) jobTimezone=UTC sourceCluster=primaryCluster targetCluster=backupCluster jobClusterName=primaryCluster sourceDir=/user/ambari-qa/sales-monthly/input targetDir=/user/ambari-qa/sales-monthly/output removeDeletedFiles=true skipChecksum=false preservePermission=true preserveUser=true
Refer hdfs-mirroring-properties.json for properties supported in HDFS mirroring.
Supported DistCp options for Hive mirroring in HDP 2.5:
Hive mirroring job can be scheduled using extension as below:
falcon extension -submitAndSchedule -extensionName hive-mirroring -file hive-sales-monthly.properties Content of hive-sales-monthly.properties file: jobName=hive-sales-monthly sourceCluster=primaryCluster targetCluster=backupCluster jobClusterName=primaryCluster jobValidityStart=2016-07-19T00:02Z jobValidityEnd=2018-05-25T11:02Z jobFrequency=minutes(30) jobRetryPolicy=periodic jobRetryDelay=minutes(30) jobRetryAttempts=3 distcpMaxMaps=1 distcpMapBandwidth=100 maxEvents=-1 replicationMaxMaps=5 sourceDatabases=default sourceTables=* sourceHiveServer2Uri=hive2://primary:10000 targetHiveServer2Uri=hive2://backup:10000
Refer hive-mirroring-properties.json for properties supported in Hive mirroring.
Created on 04-06-2017 05:39 AM
Thanks Sowmya, for sharing in details.
Can you please also share
How replication / DistCp job works e.g. mapper writes to temp directory on source name-node and copy to target once done.
What if jobs replicating 100 files. fails at mapper end.
If a copier failed for some subset of its files what will happen, A directory will become inconsistent ?
Is Atomic feature supported in HDP2.5, how the data inconsistency will be taken care in case of Job failure.
E.g If there are 200 GB files in a directory source which has been changed and replication jobs replicating the data to target fails. In case 100 GB data has been written at target dirctory and fails. Will it be rolled back to the previous state of only 100 GB will be written at target ?
Assumption : we have 100s of files to be transferred, this file size is relatively bigger (130 GB), Block size is 124MB. overwrite = true.
User | Count |
---|---|
758 | |
379 | |
316 | |
309 | |
270 |