Support Questions

Find answers, ask questions, and share your expertise

Falcon HDFS mirror distcp options

avatar
Contributor

Hello,

I'm working with Falcon using the built-in HDFS mirroring capabilities and would like to enable two distcp options in the workflow XML: the -atomic flag and -strategy flags. Below is my Oozie workflow with these two options commented out, as this approach was unsuccessful. Is there a way to pass these in using a -D option or would the FeedReplicator class need to be modified for this functionality?

<workflow-app xmlns='uri:oozie:workflow:0.3' name='falcon-dr-fs-workflow'>
    <start to='dr-replication'/>
    <!-- Replication action -->
    <action name="dr-replication">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property> <!-- hadoop 2 parameter -->
                    <name>oozie.launcher.mapreduce.job.user.classpath.first</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>oozie.launcher.mapred.job.priority</name>
                    <value>${jobPriority}</value>
                </property>
                <property>
                    <name>oozie.use.system.libpath</name>
                    <value>true</value>
                </property>
                <property>
                    <name>oozie.action.sharelib.for.java</name>
                    <value>distcp</value>
                </property>
                <property>
                    <name>oozie.launcher.oozie.libpath</name>
                    <value>${wf:conf("falcon.libpath")}</value>
                </property>
                <property>
                    <name>oozie.launcher.mapreduce.job.hdfs-servers</name>
                    <value>${drSourceClusterFS},${drTargetClusterFS}</value>
                </property>
            </configuration>
            <main-class>org.apache.falcon.replication.FeedReplicator</main-class>
            <arg>-Dmapred.job.queue.name=${queueName}</arg>
            <arg>-Dmapred.job.priority=${jobPriority}</arg>

            <!--arg>-atomic</arg>
            <arg>-strategy</arg>
            <arg>dynamic</arg-->

            <arg>-maxMaps</arg>
            <arg>${distcpMaxMaps}</arg>
            <arg>-mapBandwidth</arg>
            <arg>${distcpMapBandwidth}</arg>
            <arg>-sourcePaths</arg>
            <arg>${drSourceDir}</arg>
            <arg>-targetPath</arg>
            <arg>${drTargetClusterFS}${drTargetDir}</arg>
            <arg>-falconFeedStorageType</arg>
            <arg>FILESYSTEM</arg>
            <arg>-availabilityFlag</arg>
            <arg>${availabilityFlag == 'NA' ? "NA" : availabilityFlag}</arg>
            <arg>-counterLogDir</arg>
            <arg>${logDir}/job-${nominalTime}/${srcClusterName == 'NA' ? '' : srcClusterName}</arg>
        </java>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>
            Workflow action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name="end"/>
</workflow-app>
1 ACCEPTED SOLUTION

avatar

@Kyle Dunn: Falcon doesn't support those DistCP options and yes that would require a code change.

View solution in original post

5 REPLIES 5

avatar

@Kyle Dunn: Falcon doesn't support those DistCP options and yes that would require a code change.

avatar
Contributor

Would you be able to provide an example of what this code change might be similar to in the existing FeedReplicator code?

avatar
Contributor

Alternatively, what are the limitations of out-of-stack version support for Falcon? The snapshot-based replication in Falcon 0.10 provides the ultimate functionality I'm looking for, but am currently running on HDP 2.3 / 2.4.

avatar

What limitations are we talking about here? Sorry, I don't understand your question.

If you are asking about DIstCP options supported in HDFS Mirroirng, currently below options are supported

  • maxMaps
  • mapBandwidth

Below additional options can be supported by using workaround given below:

  • overwrite
  • ignoreErrors
  • skipChecksum
  • removeDeletedFiles
  • preserveBlockSize
  • preserveReplicationNumber
  • preservePermission

Please modify the WF hdfs-replication-workflow.xml as below. After distcpMapBandwidth add below content

<arg>-overwrite </arg>
<arg>${overwrite}</arg>
<arg>-ignoreErrors </arg>
<arg>${ignoreErrors}</arg>
<arg>-skipChecksum </arg>
<arg>${skipChecksum}</arg>
<arg>-removeDeletedFiles </arg>
<arg>${removeDeletedFiles}</arg>
<arg>-preserveBlockSize </arg>
<arg>${preserveBlockSize}</arg>
<arg>-preserveReplicationNumber </arg>
<arg>${preserveReplicationNumber}</arg>
<arg>-preservePermission </arg>
<arg>${preservePermission}</arg>

Pass below options in hdfs-replication.properties

overwrite=false
ignoreErrors=false
skipChecksum=false
removeDeletedFiles=true
preserveBlockSize=true
preserveReplicationNumber=true
preservePermission=true

These will work OOTB as FeedReplicator already has support for this and hence no code change is required. Thanks!

avatar
Expert Contributor
Hi!

After change parameter preserveBlockSize & skipChecksum on target site,  do not see any change in xml file for  task (after recreate task) :

[hdfs@target ~]$ hdfs dfs -ls /apps/falcon/extensions/hdfs-mirroring/retargets/runtime/hdfs-mirroring-workflow.xml
-rwxr-xr-x   2 hdfs users       4943 2017-11-13 22:39 /apps/falcon/extensions/hdfs-mirroring/retargets/runtime/hdfs-mirroring-workflow.xml        <<<  change  this file ( on target size)
[hdfs@target ~]$
t1.xml[hdfs@target ~]$ grep -i preserveBlockSize hdfs-mirroring-workflow.xml
            <arg>-preserveBlockSize</arg>
            <arg>${preserveBlockSize}</arg>
            <arg>-preserveBlockSize</arg> <arg>true</arg>
[hdfs@target ~]$
[hdfs@target ~]$
[hdfs@target ~]$ grep -i skipChecksum hdfs-mirroring-workflow.xml
            <arg>-skipChecksum</arg>
            <arg>${skipChecksum}</arg>
            <arg>-skipChecksum</arg> <arg>true</arg>
[hdfs@target ~]$

Please help me.

Where I can find file hdfs-replication.properties ?