Support Questions
Find answers, ask questions, and share your expertise

How to append flowfiles into parquet instead of overwrite existing one when using PutParquet processor?

Explorer

I have a process group as follow:

ListFile > FetchFile > mergeContent >convertCSVtoAvro > PutParquet

On 1st execution, everything works fine, which all 15 files in the directory are written into parquet.

After that, if new file is added in the directory, it will be ingested, but the original parquet file was overwritten.

What I want is `append` contents of the new file into the parquet file, but not 'overwrite' it.

I would like to know are there any approach/ processor to resolve this issue?

6 REPLIES 6

Super Guru
@Michael LY

Before PutParquet processor use UpdateAttribute processor and add new property as

filename as ${UUID()} //changing the filename to UUID value

The FlowFile will also have an attribute named UUID, which is a unique identifier for this FlowFile. In this processor we are changing the filename to UUID, It will help you to not overwrite the existing files.

Processor Configs:-

43659-update.png

Flow:-

ListFile > FetchFile > mergeContent >convertCSVtoAvro >UpdateAttribute > PutParquet

Explorer

@Shu

Thanks for the reply, however using UpdateAttribute processor will have multiple parquet files output.

What I want to achieve is:

N files under directory > 1 .parquet file

Super Guru
@Michael LY

I don't think there is a processor which we can merge parquet files into one but we can achieve by using PutHiveQL processor.

Flow:-

ListFile > FetchFile > mergeContent > convertCSVtoAvro > UpdateAttribute > PutParquet(success relation) > ReplaceText(success) > PutHiveQL

PutParquet:-

Store the parquet files into Temporary HDFS directory and Create a table on top of this temp directory.

Use the success relation of PutParquet processor to Replace Text

Replace Text Processor:-

Create another final (or) target table in hive

Configs:-

43661-replace.png

Replacement Value property as

insert overwrite table <final-table-name> select * from <temp-table-name>

The above insert overwrite statement creates one parquet file in final table by selecting N files from temp table.

Connect the success relation from ReplaceText to PutHiveQL processor.

New Contributor

This is now possible with NiFi 1.10 using the new parquet reader and writer.

New Contributor

@SouperDude Can I know how ?

New Contributor

@SouperDude  Can you tell me the detailed steps?

; ;