Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Looking for a replacement for Hadoop to Hadoop copy (distcp) in NiFi? Can it be achieved? What the topology be like.

avatar
New Contributor

Looking for a replacement for Hadoop to Hadoop copy (distcp) in NiFi? Can it be achieved?

Aim is to replicate data to another in a near real-time fashion.

1 ACCEPTED SOLUTION

avatar

@PS

Yes, you can use the GetHDFS, ListHDFS, FetchHDFS and PutHDFS processors for the source and target.

GetHDFS: Monitors a user-specified directory in HDFS. Whenever a new file enters HDFS, it is copied into NiFi and deleted from HDFS. This Processor is expected to move the file from one location to another location and is not to be used for copying the data. This Processor is also expected to be run On Primary Node only, if run within a cluster. In order to copy the data from HDFS and leave it in-tact, or to stream the data from multiple nodes in the cluster, use the ListHDFS Processor.

ListHDFS / FetchHDFS: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across the cluster and sent to the FetchHDFS Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain the content fetched from HDFS.

PutHDFS: Writes the HDFS file to the specified location

With both you can specify various conditions or parameters such as min/max file age to read and selection of files by regex at source, as well as owner of file and replication factor at destination.

Deployment-wise, NiFi would essentially sit in a separate box between the two clusters.

More Details:

https://nifi.apache.org/docs.html

https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#what-processors-are-available

Nifi stores data flowing through it locally in addition to storing provenance (lineage) information. Take a look at the below link under the "Content Repository" section to see the settings needed to limit the storage used for each, otherwise your disks may fill up fast

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

View solution in original post

3 REPLIES 3

avatar

@PS

Yes, you can use the GetHDFS, ListHDFS, FetchHDFS and PutHDFS processors for the source and target.

GetHDFS: Monitors a user-specified directory in HDFS. Whenever a new file enters HDFS, it is copied into NiFi and deleted from HDFS. This Processor is expected to move the file from one location to another location and is not to be used for copying the data. This Processor is also expected to be run On Primary Node only, if run within a cluster. In order to copy the data from HDFS and leave it in-tact, or to stream the data from multiple nodes in the cluster, use the ListHDFS Processor.

ListHDFS / FetchHDFS: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across the cluster and sent to the FetchHDFS Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain the content fetched from HDFS.

PutHDFS: Writes the HDFS file to the specified location

With both you can specify various conditions or parameters such as min/max file age to read and selection of files by regex at source, as well as owner of file and replication factor at destination.

Deployment-wise, NiFi would essentially sit in a separate box between the two clusters.

More Details:

https://nifi.apache.org/docs.html

https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#what-processors-are-available

Nifi stores data flowing through it locally in addition to storing provenance (lineage) information. Take a look at the below link under the "Content Repository" section to see the settings needed to limit the storage used for each, otherwise your disks may fill up fast

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

avatar
New Contributor

Hello,

 

I am testing the solution and it's working well but what about file delete ? How can we handle that  action ?

I thought to use a GetHDFSEvent to capture unlink action with DeleteHDFS processor but is seems that GetHDFSEvents is not compatible with ADLS storage

 

Since this post is quite old, is there any new way to replicate data between two different clusters ?

In particular in an CDP public cloud environment ? (S3 to S3 or ADLS to ADLS)

avatar
Super Mentor

@CristoE 

 

Since this question already has an accepted solution and is specific to DISTCP replacement for HDFS, It would be much better to start an entirely new question in the community.  You can always add a link to this question solution as reference in your new question post. You would get more visibility that way and we would not dilute the answer to this question with suggestions related to ADLS rather then HDFS.