Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi PutParquet processor, how to enter json schema info?

avatar
Contributor

I'm trying to convert local json file with getfile processor into parquet file on HDFS.  I'm following this guide:

 

https://medium.com/@abdelkrim.hadjidj/using-apache-nifi-for-json-to-parquet-conversion-945219d5caba

 

I choose Record Reader of JsonTreeReader, but I don't see any of the schema properties in the putparquet processor? (Schema Access Strategy, Schema Text).  How do I get those to show up, or how do I enter them manually?

1 ACCEPTED SOLUTION

avatar
Super Mentor

@medloh 

The article you are using for reference is old and bit out of date.
As part of the  work that went into NIFI-3921 the schema properties within the putParquet processor were removed.  Before these changes were made at the time of that article you referenced, you had to set the schema properties and they had to match the schema properties set in the recordReader.   With the changes the processor simply gets them from the reader so they do not need to be configured a second time in the processor properties.

Also at time of that article there was no parquetReader or ParquetRecordSetWriter controller services.  Now that NiFi has a ParquetReader and writer, you can use the ConvertRecord processor to read a source FlowFiles and convert it parquet within your dataflow and then have freedom to use whatever processor you want downstream in your dataflow to write out the parquet content.  You can think of the putParquet as a combination of ParquetRecordSetWriter and putHDFS with a selectable recordReader only.

Hope this helps,

Matt

View solution in original post

5 REPLIES 5

avatar
Super Mentor

@medloh 

The article you are using for reference is old and bit out of date.
As part of the  work that went into NIFI-3921 the schema properties within the putParquet processor were removed.  Before these changes were made at the time of that article you referenced, you had to set the schema properties and they had to match the schema properties set in the recordReader.   With the changes the processor simply gets them from the reader so they do not need to be configured a second time in the processor properties.

Also at time of that article there was no parquetReader or ParquetRecordSetWriter controller services.  Now that NiFi has a ParquetReader and writer, you can use the ConvertRecord processor to read a source FlowFiles and convert it parquet within your dataflow and then have freedom to use whatever processor you want downstream in your dataflow to write out the parquet content.  You can think of the putParquet as a combination of ParquetRecordSetWriter and putHDFS with a selectable recordReader only.

Hope this helps,

Matt

avatar
Contributor

Thanks Matt,

 

The author of that article recommended this, I'll give it a try.

 

" Also, NiFi has now a Parquet Record that you can use outside of the PutParquet. I advise you to use ConvertRecord then PutHDFS directly. This is better."

 

I'm a little unsure of how and where I need to configure the schemas but hopefully I'll figure it out.

 

avatar
Super Mentor

@medloh 

The Schema only needs to be defined in the RecordReader configured in the PutParquet processor.

In the case of the ConvertRecord processor there exists both a Record reader and a Record Writer.  You can have the RecordReader get the Schema from the Record Writer or define its own Schema.

Hope this helps,

Matt

avatar
Contributor

Thanks, 

 

I set the same avro schema in both the reader and writer before reading this, and that worked after I set some optional fields.

 

Last task is to gain a better understanding of how I can control the output filename.  Maybe something like this:

 

https://community.cloudera.com/t5/Support-Questions/Nifi-date-filename/m-p/172305

 

 

avatar
Super Mentor

@medloh 

 

That is the correct solution here, the filename is always stored in a FlowFile attribute named "filename".
Using the updateAttribute processor is the easiest way to manipulate the FlowFile attribute.  

You can use other attributes, static text, and even subjectless functions like "now()" or "nextInt()" to create dynamic filenames for each FlowFile.
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Hope this helps,

Matt