Created on 04-19-2016 09:40 PM - edited 08-17-2019 12:44 PM
Avro is a popular file format within the Big Data and streaming space. Avro has 3 important characteristics that make it a great fit for both Big Data and streaming applications.
Now that all of the pros of Avro have been called out there is a problem. In reality we rarely encounter Avro files while ingesting/streaming data. Why is this? Mostly because storing data in the Avro format requires defining the Avro schema up front. Defining the Avro schema up front requires some planning and knowledge of the Avro file format itself not everyone has. This is a tragedy given the many benefits Avro provides to us. This was the driving factor for me creating the “InferAvroSchema” processor within Apache NiFi. InferAvroSchema exists to help endusers who either don’t have the time or the knowledge to create Avro files. InferAvroSchema exists to overcome the initial creation complexity issues with Avro and allows Apache NiFi users to quickly take more common flat data files, like CSV, and transform them into Avro. So now that we have a little background lets get into the details about how we make this happen using Apache NiFi and InferAvroSchema. Before we start lets take a look at the end result to get the big picture.
Our end result is a workflow that takes loads a CSV file holding Weather specific data and converts it to an Avro file. The Weather.csv file is loaded using GetFile and then examined by the InferAvroSchema processor to determine the appropriate Avro schema. After the Avro schema is generated the ConvertCSVToAvro processor uses that schema to convert the CSV weather data to an Avro file. The resulting Avro file is ultimately written back to a local file on the NiFi instance machine.
While the CSV data certainly doesn’t have to come from a local file (could have been FTP, S3, HDFS, etc) that was the easiest to demonstrate here. There is a local CSV file on my Mac called “Weather.csv”. “Weather.csv” is loaded by the GetFile processor which places the complete contents of “Weather.csv” into a new NiFi FlowFile. Below is a snippet pf the contents of “Weather.csv” to provide context.
As you can see the CSV data contains a couple of different weather data points for a certain zip code. We want to take this data from its original CSV format and convert it to an Avro file. So where do we start? First we want to use the “InferAvroSchema” processor to help us make our Avro schema definition without having the manually define it ourselves since we are pretending we have no idea how to make Avro schemas by hand. InferAvroSchema can examine the contents of CSV or JSON data and provide for us a recommended Avro schema definition based on the data that it encounters in the incoming FlowFile content. InferAvroSchema provides a lot of flexibility via configurations so lets go through those and what they mean now.
The above screenshot shows the available properties for InferAvroSchema. At first look it can be a little daunting so lets breakdown what is happening here by stepping through each property.
At this point you will have a Avro schema that was automatically generated based on the raw incoming CSV data, congratulations! This doesn’t do us much good however. We still need to put that Avro schema to work and convert the original Weather.csv data into an Avro file. We do that in our next step with the ConvertCSVToAvro processor. The configuration for that processor is described below.
The properties for ConvertCSVToAvro are a little more straightfoward so we aren’t going to go through them one by one. I do want to point out the value for Record Schema however. If you will notice it has a value of ${inferred.avro.schema}. If you recall in the InferAvroSchema processor above we told it to write the resulting Avro Schema to the FlowFile attribute. So now we are able to access that value using the NiFi expression language here. The name of the FlowFile property will always be inferred.avro.schema.
At this point our Weather.csv data has successfully been converted to an Avro file and we can do whatever we desire with it. I chose to simply write the data back to another local file which will be named “Weather.csv.avro”. Here is a screenshot of that output.
Created on 04-24-2016 05:53 PM
This is a great article Jeremy, solving a real case many business face now. Is this project on public domain on git-hub ?
Thanks
John
Created on 07-19-2016 09:58 PM - edited 08-17-2019 12:43 PM
Great article. One thing I've found with the InferSchemaWithAvro and working with CSVs is that it only seems to infer the type based on the first line after the header.
The "Number Of Records To Analyze" property specifically states that it's only to be used with JSON content type. I'm having some difficulty right now thinking on a way to work around this. I've got a CSV column with "12345" on the first line and "12345.45" on the second line and it's inferring the type as LONG when I want it to be DOUBLE.
EDIT: It also doesn't work completely as expected with negative values. (STRING instead of numeric)
Created on 09-09-2016 08:08 AM
This is great article, I am also facing same issue in inferring the type for number column in CSV. It has DOUBLE value data but processor is inferring the type as LONG based on first row.
Any work around for this? except putting first row with double value in CSV. 🙂
Created on 04-13-2020 04:31 AM - edited 04-13-2020 04:32 AM
You have to part the content first as line by line utilizing Split Text Processor. Use regex to extricate values by utilizing Extract Text processor, it will results esteems as traits for the each stream record. Supplant content processor to supplant the traits as substance of the stream record LiteBlue.