Support Questions

medloh · ‎02-05-2021

I'm trying to convert local json file with getfile processor into parquet file on HDFS. I'm following this guide:

https://medium.com/@abdelkrim.hadjidj/using-apache-nifi-for-json-to-parquet-conversion-945219d5caba

I choose Record Reader of JsonTreeReader, but I don't see any of the schema properties in the putparquet processor? (Schema Access Strategy, Schema Text). How do I get those to show up, or how do I enter them manually?

MattWho · ‎02-05-2021

@medloh

The article you are using for reference is old and bit out of date.
As part of the work that went into NIFI-3921 the schema properties within the putParquet processor were removed. Before these changes were made at the time of that article you referenced, you had to set the schema properties and they had to match the schema properties set in the recordReader. With the changes the processor simply gets them from the reader so they do not need to be configured a second time in the processor properties.

Also at time of that article there was no parquetReader or ParquetRecordSetWriter controller services. Now that NiFi has a ParquetReader and writer, you can use the ConvertRecord processor to read a source FlowFiles and convert it parquet within your dataflow and then have freedom to use whatever processor you want downstream in your dataflow to write out the parquet content. You can think of the putParquet as a combination of ParquetRecordSetWriter and putHDFS with a selectable recordReader only.

Hope this helps,

Matt

View solution in original post

MattWho · ‎02-05-2021

@medloh

The article you are using for reference is old and bit out of date.
As part of the work that went into NIFI-3921 the schema properties within the putParquet processor were removed. Before these changes were made at the time of that article you referenced, you had to set the schema properties and they had to match the schema properties set in the recordReader. With the changes the processor simply gets them from the reader so they do not need to be configured a second time in the processor properties.

Also at time of that article there was no parquetReader or ParquetRecordSetWriter controller services. Now that NiFi has a ParquetReader and writer, you can use the ConvertRecord processor to read a source FlowFiles and convert it parquet within your dataflow and then have freedom to use whatever processor you want downstream in your dataflow to write out the parquet content. You can think of the putParquet as a combination of ParquetRecordSetWriter and putHDFS with a selectable recordReader only.

Hope this helps,

Matt

medloh · ‎02-05-2021

Thanks Matt,

The author of that article recommended this, I'll give it a try.

" Also, NiFi has now a Parquet Record that you can use outside of the PutParquet. I advise you to use ConvertRecord then PutHDFS directly. This is better."

I'm a little unsure of how and where I need to configure the schemas but hopefully I'll figure it out.

MattWho · ‎02-08-2021

@medloh

The Schema only needs to be defined in the RecordReader configured in the PutParquet processor.

In the case of the ConvertRecord processor there exists both a Record reader and a Record Writer. You can have the RecordReader get the Schema from the Record Writer or define its own Schema.

Hope this helps,

Matt

medloh · ‎02-08-2021

Thanks,

I set the same avro schema in both the reader and writer before reading this, and that worked after I set some optional fields.

Last task is to gain a better understanding of how I can control the output filename. Maybe something like this:

https://community.cloudera.com/t5/Support-Questions/Nifi-date-filename/m-p/172305

MattWho · ‎02-09-2021

@medloh

That is the correct solution here, the filename is always stored in a FlowFile attribute named "filename".
Using the updateAttribute processor is the easiest way to manipulate the FlowFile attribute.

You can use other attributes, static text, and even subjectless functions like "now()" or "nextInt()" to create dynamic filenames for each FlowFile.
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Hope this helps,

Matt

Cloudera Community

Support Questions

Nifi PutParquet processor, how to enter json schema info?