Created 09-01-2016 06:19 PM
Hello,
I'm working with Falcon using the built-in HDFS mirroring capabilities and would like to enable two distcp options in the workflow XML: the -atomic flag and -strategy flags. Below is my Oozie workflow with these two options commented out, as this approach was unsuccessful. Is there a way to pass these in using a -D option or would the FeedReplicator class need to be modified for this functionality?
<workflow-app xmlns='uri:oozie:workflow:0.3' name='falcon-dr-fs-workflow'> <start to='dr-replication'/> <!-- Replication action --> <action name="dr-replication"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <!-- hadoop 2 parameter --> <name>oozie.launcher.mapreduce.job.user.classpath.first</name> <value>true</value> </property> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> <property> <name>oozie.launcher.mapred.job.priority</name> <value>${jobPriority}</value> </property> <property> <name>oozie.use.system.libpath</name> <value>true</value> </property> <property> <name>oozie.action.sharelib.for.java</name> <value>distcp</value> </property> <property> <name>oozie.launcher.oozie.libpath</name> <value>${wf:conf("falcon.libpath")}</value> </property> <property> <name>oozie.launcher.mapreduce.job.hdfs-servers</name> <value>${drSourceClusterFS},${drTargetClusterFS}</value> </property> </configuration> <main-class>org.apache.falcon.replication.FeedReplicator</main-class> <arg>-Dmapred.job.queue.name=${queueName}</arg> <arg>-Dmapred.job.priority=${jobPriority}</arg> <!--arg>-atomic</arg> <arg>-strategy</arg> <arg>dynamic</arg--> <arg>-maxMaps</arg> <arg>${distcpMaxMaps}</arg> <arg>-mapBandwidth</arg> <arg>${distcpMapBandwidth}</arg> <arg>-sourcePaths</arg> <arg>${drSourceDir}</arg> <arg>-targetPath</arg> <arg>${drTargetClusterFS}${drTargetDir}</arg> <arg>-falconFeedStorageType</arg> <arg>FILESYSTEM</arg> <arg>-availabilityFlag</arg> <arg>${availabilityFlag == 'NA' ? "NA" : availabilityFlag}</arg> <arg>-counterLogDir</arg> <arg>${logDir}/job-${nominalTime}/${srcClusterName == 'NA' ? '' : srcClusterName}</arg> </java> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message> Workflow action failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message> </kill> <end name="end"/> </workflow-app>
Created 09-01-2016 06:55 PM
@Kyle Dunn: Falcon doesn't support those DistCP options and yes that would require a code change.
Created 09-01-2016 06:55 PM
@Kyle Dunn: Falcon doesn't support those DistCP options and yes that would require a code change.
Created 09-01-2016 06:55 PM
Would you be able to provide an example of what this code change might be similar to in the existing FeedReplicator code?
Created 09-01-2016 11:16 PM
Alternatively, what are the limitations of out-of-stack version support for Falcon? The snapshot-based replication in Falcon 0.10 provides the ultimate functionality I'm looking for, but am currently running on HDP 2.3 / 2.4.
Created 09-02-2016 05:17 AM
What limitations are we talking about here? Sorry, I don't understand your question.
If you are asking about DIstCP options supported in HDFS Mirroirng, currently below options are supported
Below additional options can be supported by using workaround given below:
Please modify the WF hdfs-replication-workflow.xml as below. After distcpMapBandwidth add below content
<arg>-overwrite </arg> <arg>${overwrite}</arg> <arg>-ignoreErrors </arg> <arg>${ignoreErrors}</arg> <arg>-skipChecksum </arg> <arg>${skipChecksum}</arg> <arg>-removeDeletedFiles </arg> <arg>${removeDeletedFiles}</arg> <arg>-preserveBlockSize </arg> <arg>${preserveBlockSize}</arg> <arg>-preserveReplicationNumber </arg> <arg>${preserveReplicationNumber}</arg> <arg>-preservePermission </arg> <arg>${preservePermission}</arg>
Pass below options in hdfs-replication.properties
overwrite=false ignoreErrors=false skipChecksum=false removeDeletedFiles=true preserveBlockSize=true preserveReplicationNumber=true preservePermission=true
These will work OOTB as FeedReplicator already has support for this and hence no code change is required. Thanks!
Created 11-14-2017 07:57 AM
Hi! After change parameter preserveBlockSize & skipChecksum on target site, do not see any change in xml file for task (after recreate task) : [hdfs@target ~]$ hdfs dfs -ls /apps/falcon/extensions/hdfs-mirroring/retargets/runtime/hdfs-mirroring-workflow.xml -rwxr-xr-x 2 hdfs users 4943 2017-11-13 22:39 /apps/falcon/extensions/hdfs-mirroring/retargets/runtime/hdfs-mirroring-workflow.xml <<< change this file ( on target size) [hdfs@target ~]$ t1.xml[hdfs@target ~]$ grep -i preserveBlockSize hdfs-mirroring-workflow.xml <arg>-preserveBlockSize</arg> <arg>${preserveBlockSize}</arg> <arg>-preserveBlockSize</arg> <arg>true</arg> [hdfs@target ~]$ [hdfs@target ~]$ [hdfs@target ~]$ grep -i skipChecksum hdfs-mirroring-workflow.xml <arg>-skipChecksum</arg> <arg>${skipChecksum}</arg> <arg>-skipChecksum</arg> <arg>true</arg> [hdfs@target ~]$ Please help me. Where I can find file hdfs-replication.properties ?