Support Questions

Find answers, ask questions, and share your expertise

zip folder using nifi

avatar
Expert Contributor

I am having a YYYYMMdd folder in which I have files with YYYYMMddHHmm format.

ex:

/day/20171202/201712020000..201712022359

Can we zip the folder 20171202 to compress and put it back in the same location.

1 ACCEPTED SOLUTION

avatar
Master Guru
@Mark

You can do that by using

GetHDFS,GetFTP,GetSFTP processors by using

Keep Source File

false //by default it is set to false.

So once you configure GET processors then all the files in that directory will be deleted.

GetHDFS Configs:-

43827-gethdfs.png

Then use PutHDFS,PutFTP,PutSFTP processors and change the property

Compression codec

BZIP

Directory

<same-directory-path-as-gethdfs-directory-info>

PutHDFS Configs:-

43828-puthdfs.png

Right now in Put hdfs processor has been configured the same directory as GetHDFS processo,r we have configured puthdfs processor with Compression codec as BZIP.

When we are storing the data into HDFS directory we are compressing the files and storing them in HDFS directory.

FLOW:-

GetHDFS(Success Relation) //get the files from hdfs directory and delete them in the source directory-->
PutHDFS //Compress the files and store them in the same directory source directory.

If you are thinking to merge the files then use merge content processor before PutHDFS processor.

Use the below reference to configure merge content processor.

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...

View solution in original post

8 REPLIES 8

avatar
Master Guru
@Mark

You can do that by using

GetHDFS,GetFTP,GetSFTP processors by using

Keep Source File

false //by default it is set to false.

So once you configure GET processors then all the files in that directory will be deleted.

GetHDFS Configs:-

43827-gethdfs.png

Then use PutHDFS,PutFTP,PutSFTP processors and change the property

Compression codec

BZIP

Directory

<same-directory-path-as-gethdfs-directory-info>

PutHDFS Configs:-

43828-puthdfs.png

Right now in Put hdfs processor has been configured the same directory as GetHDFS processo,r we have configured puthdfs processor with Compression codec as BZIP.

When we are storing the data into HDFS directory we are compressing the files and storing them in HDFS directory.

FLOW:-

GetHDFS(Success Relation) //get the files from hdfs directory and delete them in the source directory-->
PutHDFS //Compress the files and store them in the same directory source directory.

If you are thinking to merge the files then use merge content processor before PutHDFS processor.

Use the below reference to configure merge content processor.

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...

avatar
Expert Contributor

@Shu

I tried the above as you said.

What I am getting is /day/20171202/YYYYMMddHHmm.bz2

what I am looking for is /day/20171202.zip

can you help me please

avatar
Master Guru
<br>

@Mark

Method1:-

Use Execute Process processor with below configs:-

43863-executeprocess.png

Properties:-

Command

zip

Command Arguments

-rm /day/${now():format('yyyyMMdd')}.zip /day/${now():format('yyyyMMdd')}

i have configured above argument with Expression language but you can change above arguments as per your requirements.

  • we are zipping the source folder and Deletes the original files after zipping.
  • If a directory becomes empty after removal of the files, the directory is also removed.
  • No deletions are done until zip has created the archive without error.
  • This is useful for conserving disk space, but is potentially dangerous removing all input files..!

(or)

Method2:-

we can zip the folder by using execute process processor then use execute stream command processor to delete the source directory.

Use Execute Process Processor and Configure the processor as below.

43861-executeprocess.png

Command

zip

Command Arguments

-r /day/${now():format('yyyyMMdd')}.zip /day/${now():format('yyyyMMdd')}

So in this processor we are using Expression language and Zip command and passing our desired zip folder name and source folder path.

Then use Execute Process(success relation) to Execute Stream command processor to delete the source directory.

Configs:-

43862-executestreamcommand.png

For removing directory we need to use a simple shell script

bash# cat del.sh
#!/bin/bash
rm -rf $1

the above shell script will expects an argument and we are passing that from command Arguments property as

/day/${now():format('yyyyMMdd')}

so in this processor we are removing the directory.

Make sure nifi user having access to delete these directories.

You can choose the best method that fit for your case.

avatar
Expert Contributor

@Shu

I tried the above.

I am getting an error

42972-execute-process.jpg

I have the directory and files, still I am getting this error.

avatar
Master Guru

@Mark,

I think you are using Windows and windows won't have zip utility by default, Zip utility will be presented in linux env as i tried in linux.

To resolve this you need to download

https://www.microsoft.com/en-us/download/details.aspx?id=17657 and run the .exe file.

In Execute Process Processor use

Command

C:\Program Files (x86)\Windows Resource Kits\Tools\compress.exe //path where compress.exe got installed

Command Arguments

C:\<input directory> C:\<output-directory.zip>

Configs:-

43888-executeprocess.png

So we are creating zip directory in Execute Process processor.

Your case Input directory like

C:\day\${now():format('yyyyMMdd')}

Output Directory

C:\day\${now():format('yyyyMMdd')}.zip

Then use Execute Stream Command Processor to delete the input Directory(Source directory).

We need to create .bat file that would delete the input directory in this processor.

cmd>remove_dir.bat

@RD /s/q %1

So the above script would get argument and delete the directory we are passing that argument as our input directory.

What is /s and /Q?

RD [/S] [/Q] [drive:]path
/S      Removes all directories and files in the specified directory
        in addition to the directory itself.  Used to remove a directory
        tree.
/Q      Quiet mode, do not ask if ok to remove a directory tree with /S

Configs:-

Command Arguments

"C:\day\${now():format('yyyyMMdd')}"

Command Path

C:<delete-directory.bat file path>

For testing i tried with below configs:-

43889-executestreamcommand.png

In this processor we are deleting the input directory.

avatar
Master Guru

@Mark

Directory needs to be in local not in hadoop directory to work with zip command.

Make sure zip is installed in your node.

Command to check zip is installed

#zip

42980-zip.png

after executing zip if it shows output as above that means zip is installed on the node.

if not installed then do

#yum install zip

If you want to do zip the hdfs files then follow below steps:-

Use Get HDFS processor to pick your files from HDFS,Use Configs for gethdfs same as my first answer

then use MergeContent processor with

42982-merge.png

As every flowfile from GetHDFS processor will have path attribute associated with it, we are using path attribute as our Correlation Attribute Name in merge content processor.

Processor waits for 1 min and merges all the flow files that having same path attribute.

Change Keep Path property as per your requirements.

Keep Pathfalse
  • true
  • false
If using the Zip or Tar Merge Format, specifies whether or not the FlowFiles' paths should be included in their entry names; if using other merge strategy, this value is ignored

But you can change the configs as per your requirements by following below reference to configure merge content processor.

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...

Then in Put HDFS processor Use configs as my first answer and change property to

Compression codec

NONE

Because we are doing zipping in merge content processor it self no need to do compression again in PutHDFS processor.

avatar
Expert Contributor

@Shu

Thanks for the detailed explination.

I am using nifi 1.1 and in linux env.

Also the /day folder is in hdfs which is in linux.

Iam also wondered why zip was not working.

avatar
New Contributor

@Shu


I
also have same requirement to zip a folder in hdfs directly. I am using mergeContent processor with merge format ZIP, But i am not able to get a single zipped file file after the merge content processor.