Created on 11-07-201909:16 AM - edited on 02-12-202006:14 AM by SumitraMenon
With the release of Apache NiFi 1.10 (http://nifi.apache.org/download.html), a new feature is available whereby you can configure a Parquet Record reader to read incoming Parquet formatted files.
Apache NiFi Record processing allows you to read and deal with a set of data as a single unit. By supplying a schema that matches the incoming data, you can perform record processing as if the data is a "table", and easily convert between formats (CSV, JSON, Avro, Parquet, XML). Before record processing was added to Apache NiFi, you would have had to split the file line by line, perform the transformation and then merge everything back together into a single output. This is extremely inefficient. With Record processing, we can do everything to a large unit of data as a single step which improves the speed by many factors.
Apache Parquet is a columnar storage format (https://parquet.apache.org/documentation/latest/), and the format includes the schema for the data that it stores. This is a great feature that we can use with Record processing in Apache NiFi to supply the schema automatically to the flow.
Here is an example flow:
In this flow, I am reading my incoming .parquet stored files, and passing that through my QueryRecord processor. The processor has been configured with a ParquetReader. I'm using the AvroRecordSetWriter for output, but you can use also CSV,JSON,XML record writer instead:
The QueryRecord is a special processor that allows you to run SQL queries against your Flowfile, where the output is a new Flowfile with the output of the SQL query:
The raw SQL code:
Select first_name, last_name, birth_date from FLOWFILE where gender = 'M' and birth_date like '1965%'
The input in my Parquet file looks like this:
You can see it has rows for years other than 1965, Males and Females, as well as other columns not listed in the SQL query.
Once running it through my flow, I am left with the result of the SQL query, matching my search criteria (birth year = 1965 and only Males), with the three columns selected (first_name, last_name, birth_year):
Depending on your RecordWriter, you can format the output as JSON, CSV, XML or Avro, and carry on with further processing.