Community Articles

TimothySpann · ‎07-06-2017

I am always ingesting social data, IoT and mobile streams into my HDFS cluster. Unfortunately my cluster is a ephemeral cloud based Hortonworks HDP 2.6 Hadoop cluster, so I don't have a permanent store for my data. I have my processes run for a few weeks and then they are destroyed.

I wanted to quick way to save all my ORC files.

Enter NiFi.

Backup

First we list from some top level directories in HDFS to capture all the files and sub-directories we want to backup. Each processor maintains a timestamp to know what files it processed already, as new files are added they will be assimilated into the flow.

For massive data migration, we can run this on many nodes and use the Distributed Cache service to maintain the state.

Restore

The flow is very simple to restore, read from the local file system and write to HDFS. For HDFS, I use /${path} as the directory so each file is written to the correct sub-directory for it's file group. Easy it's like rsync, but it's Tim Sync. Make sure you have your Hadoop configuration file set. If you are using Kerberos make sure you set your principal and keytab, be very careful for Case-Sensitivity!

For Restore it doesn't get any simpler, I use GetFile to read from the local file system. As part of that those files are deleted, I have a big USB 3.0 drive that I want to keep them, so I copy them to a different directory for later storage. I should probably compress those. Once they get large enough I may invest in a local RPI Storage array running HDP 2.6 with some attached Western Digital PiDrives.

The Data Provenance of one of the flowfiles shows the path and filename. It makes it very easy to move between say S3, on-premise HDFS, local file systems, cloud file systems, jump drives or wherever. Your data is yours, take it with you.

Cloudera Community

Community Articles

Simple Backup and Restore of HDFS Data via HDF 3.0 / NiFi 1.2

Apache Hadoop

Apache NiFi

Cloudera DataFlow (CDF)

HDP to CDP - Atlas backup and restore

CDP Public Cloud: Datalake backup and restore

Partner Demo Kit for HDP 2.6/HDF 3.0

Backup and Restore HDP Kerberos Database

HDF 2.x/3.0: Use Ambari to enable kerberos for HDF...

Record based processors in Apache NiFi 1.2

Apache NiFi Journey from HDF 3.0 to 3.1 - Part 3

Apache Nifi (aka HDF) data flow across data center

Apache NiFi Journey from HDF 3.0 to 3.1 - Part 2

HDP3 to CDP - Atlas backup and restore using Atlas...