Created 11-22-2017 07:07 AM
I have a process group as follow:
ListFile > FetchFile > mergeContent >convertCSVtoAvro > PutParquet
On 1st execution, everything works fine, which all 15 files in the directory are written into parquet.
After that, if new file is added in the directory, it will be ingested, but the original parquet file was overwritten.
What I want is `append` contents of the new file into the parquet file, but not 'overwrite' it.
I would like to know are there any approach/ processor to resolve this issue?
Created on 11-22-2017 08:21 AM - edited 08-17-2019 10:37 PM
Before PutParquet processor use UpdateAttribute processor and add new property as
filename as ${UUID()} //changing the filename to UUID value
The FlowFile will also have an attribute named UUID, which is a unique identifier for this FlowFile. In this processor we are changing the filename to UUID, It will help you to not overwrite the existing files.
Processor Configs:-
Flow:-
ListFile > FetchFile > mergeContent >convertCSVtoAvro >UpdateAttribute > PutParquet
Created 11-22-2017 08:41 AM
Thanks for the reply, however using UpdateAttribute processor will have multiple parquet files output.
What I want to achieve is:
N files under directory > 1 .parquet file
Created on 11-22-2017 03:04 PM - edited 08-17-2019 10:36 PM
I don't think there is a processor which we can merge parquet files into one but we can achieve by using PutHiveQL processor.
Flow:-
ListFile > FetchFile > mergeContent > convertCSVtoAvro > UpdateAttribute > PutParquet(success relation) > ReplaceText(success) > PutHiveQL
PutParquet:-
Store the parquet files into Temporary HDFS directory and Create a table on top of this temp directory.
Use the success relation of PutParquet processor to Replace Text
Replace Text Processor:-
Create another final (or) target table in hive
Configs:-
Replacement Value property as
insert overwrite table <final-table-name> select * from <temp-table-name>
The above insert overwrite statement creates one parquet file in final table by selecting N files from temp table.
Connect the success relation from ReplaceText to PutHiveQL processor.
Created 12-04-2019 12:24 PM
This is now possible with NiFi 1.10 using the new parquet reader and writer.
Created 05-27-2020 03:27 AM
@SouperDude Can I know how ?
Created on 08-04-2020 01:21 AM - edited 08-04-2020 01:27 AM
@SouperDude Can you tell me the detailed steps?