Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

With the release of Apache NiFi 1.10 (http://nifi.apache.org/download.html), a new feature is available whereby you can configure a Parquet Record reader to read incoming Parquet formatted files. 

 

Apache NiFi Record processing allows you to read and deal with a set of data as a single unit. By supplying a schema that matches the incoming data, you can perform record processing as if the data is a "table", and easily convert between formats (CSV, JSON, Avro, Parquet, XML). Before record processing was added to Apache NiFi, you would have had to split the file line by line, perform the transformation and then merge everything back together into a single output. This is extremely inefficient. With Record processing, we can do everything to a large unit of data as a single step which improves the speed by many factors. 

 

Apache Parquet is a columnar storage format (https://parquet.apache.org/documentation/latest/), and the format includes the schema for the data that it stores. This is a great feature that we can use with Record processing in Apache NiFi to supply the schema automatically to the flow. 

 

Here is an example flow: 

img1.JPG

 

In this flow, I am reading my incoming .parquet stored files, and passing that through my QueryRecord processor. The processor has been configured with a ParquetReader. I'm using the AvroRecordSetWriter for output, but you can use also CSV,JSON,XML record writer instead:

img2.JPG

 

The QueryRecord is a special processor that allows you to run SQL queries against your Flowfile, where the output is a new Flowfile with the output of the SQL query:

img3.JPG

 

The raw SQL code:

 

 

 

Select first_name, last_name, birth_date from FLOWFILE where gender = 'M' and birth_date like '1965%'

 

 

 

 

The input in my Parquet file looks like this:

282390_edited.jpg

You can see it has rows for years other than 1965, Males and Females, as well as other columns not listed in the SQL query. 

 

Once running it through my flow, I am left with the result of the SQL query, matching my search criteria (birth year = 1965 and only Males), with the three columns selected (first_name, last_name, birth_year):

img5.JPG

Depending on your RecordWriter, you can format the output as JSON, CSV, XML or Avro, and carry on with further processing. 

7,510 Views
Comments
avatar
Super Guru

Anyone wishing to work with these Parquet Readers in a previous version of NiFi should take a look at my post here:

 

https://community.cloudera.com/t5/Support-Questions/Can-I-put-the-NiFi-1-10-Parquet-Record-Reader-in...

avatar
Super Guru

@wengelbrecht  do you know version of parquet this reader is supposed to support?

avatar
Contributor

@stevenmatison 

I'm not a 100% sure, but looking at NiFi 1.11.0, I can see the following list of JAR files: 

 

./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-column-1.10.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-format-2.4.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-encoding-1.10.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-common-1.10.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-avro-1.10.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-jackson-1.10.0.jar
./nifi-parquet-nar-1.11.0.nar-unpacked/NAR-INF/bundled-dependencies/parquet-hadoop-1.10.0.jar

 

avatar
Super Guru

@wengelbrecht thank you that is exactly what i needed to see.  I am having an issue with the parquet-hadoop-1.10 and need to get a 1.12 version working in NiFi and Hive....