Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

compare ListFTP contents to what is in hive.

avatar

I have a process that fetches files from an ftp location and processes them HDFS -> hive. At the end of the day I would like to reconcile that I loaded all files on FTP into the hive table. The hive table has a field for filename so I can get the distinct listing of files loaded for that day using the selectHiveQL processor. I tried getting the list of files off FTP from the listFTP processor but it is just queueing up 20 zero byte files. I envisioned being able to just listFTP -> MergeContent to have a text file of all filenames in ftp directory and then somehow compare the results of selectHiveQL and listFTP/mergecontent but mergecontent doesnt even run with the zero byte files input. Any suggestions on how to do this correctly?

1 ACCEPTED SOLUTION

avatar
Master Guru

@Margarita Uk

Use ReplaceText after ListFTP processor and before MergeContent processor with replacing the filename as the contents of the flowfile.

Replace text Processor Configs:-

62708-replacetext.png

So we are keeping the filename as the contents of the flowfile with above configs.

Then use MergeContent Processor with Below Config:-

62709-mergecontent.png

Configure the merge content processor as per your requirements and change the delimiter strategy as text and Demarcator with , and new line.

Output:-

940630588913985, <br>940634934689001

As i'm having 2 files from generateflowfile processor then i did replacetext and changed the contents of flowfile as the flowfile name in it.

After mergecontent processor we are having 2 flowfilenames with , and newline as demarcators.

For comparing store the merged file having filenames in it into Hive and then get the distinct filenames that are loaded into hive table.

Then compare both filenames

  1. by getting the results if the filename is presented in first hivetable(merged filenames table) and
  2. not in your actual hivetable that you loaded the data into.

View solution in original post

2 REPLIES 2

avatar
Master Guru

@Margarita Uk

Use ReplaceText after ListFTP processor and before MergeContent processor with replacing the filename as the contents of the flowfile.

Replace text Processor Configs:-

62708-replacetext.png

So we are keeping the filename as the contents of the flowfile with above configs.

Then use MergeContent Processor with Below Config:-

62709-mergecontent.png

Configure the merge content processor as per your requirements and change the delimiter strategy as text and Demarcator with , and new line.

Output:-

940630588913985, <br>940634934689001

As i'm having 2 files from generateflowfile processor then i did replacetext and changed the contents of flowfile as the flowfile name in it.

After mergecontent processor we are having 2 flowfilenames with , and newline as demarcators.

For comparing store the merged file having filenames in it into Hive and then get the distinct filenames that are loaded into hive table.

Then compare both filenames

  1. by getting the results if the filename is presented in first hivetable(merged filenames table) and
  2. not in your actual hivetable that you loaded the data into.

avatar

This worked great, thank you!