Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar

Recipes framework capability to support HDFS and Hive mirroring was added in Apache Falcon 0.6.0 release and it was a client side logic. With 0.10 release its moved to server side and renamed as server side extensions as part of jira https://issues.apache.org/jira/browse/FALCON-1107.

For any new mirror job to be submitted and managed, Falcon extensions should be used. Please refer https://falcon.apache.org/restapi/ExtensionEnumeration.html for more details.

Supported DistCp options for HDFS mirroring in HDP 2.5:

  • distcpMaxMaps
  • distcpMapBandwidth
  • overwrite
  • ignoreErrors
  • skipChecksum
  • removeDeletedFiles
  • preserveBlockSize
  • preserveReplicationNumber
  • preservePermission
  • preserveUser
  • preserveGroup
  • preserveChecksumType
  • preserveAcl
  • preserveXattr
  • preserveTimes

Hdfs mirroring job can be scheduled using extension as below:

falcon extension -submitAndSchedule -extensionName hdfs-mirroring  -file sales-monthly.properties

Content of sales-monthly.properties file:
jobName=sales-monthly
jobValidityStart=2016-06-30T00:00Z
jobValidityEnd=2099-12-31T11:59Z
jobFrequency=minutes(45)
jobTimezone=UTC
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
sourceDir=/user/ambari-qa/sales-monthly/input
targetDir=/user/ambari-qa/sales-monthly/output
removeDeletedFiles=true
skipChecksum=false
preservePermission=true
preserveUser=true

Refer hdfs-mirroring-properties.json for properties supported in HDFS mirroring.

Supported DistCp options for Hive mirroring in HDP 2.5:

  • distcpMaxMaps
  • distcpMapBandwidth

Hive mirroring job can be scheduled using extension as below:

falcon extension -submitAndSchedule -extensionName hive-mirroring -file hive-sales-monthly.properties


Content of hive-sales-monthly.properties file:

jobName=hive-sales-monthly
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
jobValidityStart=2016-07-19T00:02Z
jobValidityEnd=2018-05-25T11:02Z
jobFrequency=minutes(30)
jobRetryPolicy=periodic
jobRetryDelay=minutes(30)
jobRetryAttempts=3
distcpMaxMaps=1
distcpMapBandwidth=100
maxEvents=-1
replicationMaxMaps=5
sourceDatabases=default
sourceTables=*
sourceHiveServer2Uri=hive2://primary:10000
targetHiveServer2Uri=hive2://backup:10000

Refer hive-mirroring-properties.json for properties supported in Hive mirroring.

1,348 Views
Comments
avatar
Explorer

Thanks Sowmya, for sharing in details.

Can you please also share

How replication / DistCp job works e.g. mapper writes to temp directory on source name-node and copy to target once done.

What if jobs replicating 100 files. fails at mapper end.

If a copier failed for some subset of its files what will happen, A directory will become inconsistent ?

Is Atomic feature supported in HDP2.5, how the data inconsistency will be taken care in case of Job failure.

E.g If there are 200 GB files in a directory source which has been changed and replication jobs replicating the data to target fails. In case 100 GB data has been written at target dirctory and fails. Will it be rolled back to the previous state of only 100 GB will be written at target ?

Assumption : we have 100s of files to be transferred, this file size is relatively bigger (130 GB), Block size is 124MB. overwrite = true.