Created on 12-07-201609:19 AM - edited 08-17-201907:39 AM
I recently worked with a customer to demo HDF/NiFi DataFlows for their uses cases. One use case involved converting delimited row-formatted text files into Parquet and ORC columnar formats. This can quickly be done with HDF/NiFi.
Here is an easy to follow DataFlow that will convert row-formatted text files to Parquet and ORC. It may look straightforward; however, it requires some basic knowledge of Avro file formats and use of the Kite API. This article will explain the DataFlow in detail.
STEP 1: Create an Avro schema for the source data
My source data is tab delimited and consists of 8 fields.
Notice in the DataFlow that before converting the data to Parquet and ORC the data is first converted to Avro. This is done so we have schema information for the data. Prior to converting the data to Avro we need to create an Avro schema definition file for the source data. Here is what my schema definition file looks like for the 8 fields. I stored this in a file named sample.avsc.
The DataFlow uses the ‘StoreInKiteDataset’ processor. Before we can use this processor to convert the Avro data to Parquet we need to have a directory already created in HDFS to store the data as Parquet. This is done by calling the Kite API.
Kite is an API for Hadoop that lets you easily define how your data is stored:
Works with file formats including CSV, JSON, Avro, and Parquet
Local File System
Compress data: Snappy (default), Deflate, Bzip2, and Lzo
Kite will handle how the data is stored. For example, if I wanted to store incoming CSV data into a Parquet formatted Hive table, I could use the Kite API to create a schema for my CSV data and then call the Kite API to create the Hive table for me. Kite also works with partitioned data and will automatically partition records when writing.
In this example I am writing the data into HDFS.
Call Kite API to create a directory in HDFS to store the Parquet data. The file sample.avsc contains my schema definition: