Support Questions

Find answers, ask questions, and share your expertise

Nifi process big file using ConvertRecord processor

avatar
Contributor

Hi All,

I have a big file JSON format ( 1m records ). I need to replace a couple of fields in each JSON using custom logic. I used ExecuteScript processor using a Groovy script but got out of memory exception. I want to try using ConvertRecord processor.

I don't have a schema of the JSON that is why I want to use I want to use ScriptedReader and ScriptedRecordSetWriter.

Questions:

What is the best practice for a use case like process big JSON file and make changes of per record? is it a good idea to use ConvertRecord or should be a different approach.

Can you point me to an example of ScriptedReader and ScriptedRecordSetWriter using Groovy. ( I found a lot of ExecuteScript examples but Record Based Groovy I don't ).

Also is it possible to share the docs/blog link with an internals of Record based approach? I want to understand at deeper technical level why it is much robust comparing to ExecuteScript ( file-based approach ).

Thanks

Oleg.

1 ACCEPTED SOLUTION

avatar
Master Guru

Hello,

Here is a presentation with some more details about record processing:

https://www.slideshare.net/BryanBende/apache-nifi-record-processing

The main reason record processing achieves good performance is because it allows you to keep all your records together in a single flow file, rather than splitting into a million flow files with 1 record per flow file.

It sounds like you were already keeping your 1M records together in a single flow file, so if you got out of memory exceptions, the most likely reason is because you may have loaded the entire content of the flow file into memory (or the library you are using may have).

For large files you want to stream data in and out so that you never have the whole content in memory, this is how ConvertRecord works... It reads one record from reader, then passes that record to the writer, then repeats the process, so there is never more than 1 record in memory at a time.

These are the best examples of Scripted record reader/writer:

https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-pro...

View solution in original post

4 REPLIES 4

avatar
Master Guru

Hello,

Here is a presentation with some more details about record processing:

https://www.slideshare.net/BryanBende/apache-nifi-record-processing

The main reason record processing achieves good performance is because it allows you to keep all your records together in a single flow file, rather than splitting into a million flow files with 1 record per flow file.

It sounds like you were already keeping your 1M records together in a single flow file, so if you got out of memory exceptions, the most likely reason is because you may have loaded the entire content of the flow file into memory (or the library you are using may have).

For large files you want to stream data in and out so that you never have the whole content in memory, this is how ConvertRecord works... It reads one record from reader, then passes that record to the writer, then repeats the process, so there is never more than 1 record in memory at a time.

These are the best examples of Scripted record reader/writer:

https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-pro...

avatar
Contributor

Thanks Bryan for the explanation.

It is much clear now. Just to get started, if I need to read Avro content coming from QueryDataBase processor. I need to make changes in each line and convert it after to CSV.

RecordReader will read the output from QueryDataBase processor and make changes. Is there a simple example how to read Avro (groovy script ) using ScriptedReader.

The output should be CSV. Is it possible to convert Avro from CSV using ConvertRecord or it is also writing code?

If it is writing a code can you give me some example how to convert AVRO to CSV using groovy (in the context of ScriptedRecordSetWriter)

Thanks

Oleg

avatar
Master Guru

What types of changes do you need to make to the data?

You can use ConvertRecord with an AvroReader and a CsvWriter in order to convert from Avro to CSV.

You can use UpdateRecord before or after that to update the records.

avatar
Contributor

HI Bryan.

end2end flow is:

read from db ( nifi returns Avro format ) -> md5 selected columns -> convert to csv -> put to s3

I wanted to do md5 using groovy script using ConvertRecord .

But I don't know how to start with a groovy script. Unitests didn't help me. If it is possible please share the example how groovy reads Avro data using ScriptedRecordReader)

If my approach is wrong please suggest the better way.

Thanks

Oleg.