Created on 06-29-2017 07:27 PM - edited 08-17-2019 12:12 PM
Apache NiFi has grown tremendously over the past 2 and a half years since it was open sourced. The community is continuously thinking of, implementing, and contributing amazing new features. The newly released version 1.2.0 of NiFi is no exception. One of the most exciting features of this new release is the QueryRecord Processor and the Record Reader and Record Writer components that go along with it. If you aren't familiar with those components, there's a blog post that explains how those work.
This new Processor, powered by Apache Calcite, allows users to write SQL SELECT statements to run over their data as it streams through the system. Each FlowFile in NiFi can be treated as if it were a database table named FLOWFILE. These SQL queries can be used to filter specific columns or fields from your data, rename those columns/fields, filter rows, perform calculations and aggregations on the data, route the data, or whatever else you may want to use SQL for. All from the comfy confines of the most widely known and used Domain Specific Language.
This is a big deal! Of course, there are already other platforms that allow you to run SQL over arbitrary data, though. So let's touch on how NiFi differs from other platforms that run SQL over arbitrary data outside of an RDBMS:
In order to get started, we need to add a QueryRecord Processor to the graph. Once we've added the Processor to the graph, we need to configure three things: the Controller Service to use for reading data, the service to use for writing the results, and one or more SQL queries. The first time that you set this all up, it may seem like there's a lot going on. But once you've done it a couple of times, it becomes pretty trivial. The rest of this post will be dedicated to setting everything up. First, we will configure the Record Reader Controller Service. We'll choose to create a new service:
We are given a handful of different types of services to choose from. For this example, we will use CSV data, so we will use a CSVReader:
We will come back and configure our new CSVReader service in a moment. For now, we will click the "Create" button and then choose to create a new Controller Service to write the results. We will write the results in JSON format:
We can again click the "Create" button to create the service. Now that we have created our Controller Services, we will click the "Go To" button on our reader service to jump to the Controller Service configuration page:
We can click the pencil in the right-hand column to configure our CSV Reader:
The CSV Reader gives us plenty of options to customize the reader to our format, as can be seen in the above image. For this example, we will leave most of the defaults, but we will change the "Skip Header Line" Property from the default value of "false" to "true" because our data will contain a header line that we don't want to process as an actual record.
In order to run SQL over our data and make sense of the columns in our data, we need to also configure the reader with a schema for the data. We have several options for doing this. We can use an attribute on the FlowFile that includes an Avro-formatted schema. Or we can use a Schema Registry to store our schema and access it by name or by identifier and version. Let's go ahead and use a local Schema Registry and add our schema to that Registry. To do so, we will change the "Schema Access Strategy" to "Use 'Schema Name' Property." This means that we want to look up the Schema to use by name. The name of the schema is specified by the "Schema Name" Property. The default value for that property is "${schema.name}", which means that we will just use the "schema.name" attribute to identify which schema we want to use. Instead of using an attribute, we could just type the name of the schema here. Doing so would mean that we would need a different CSV Reader for each schema that we want to read, though. By using an attribute, we can have a single CSV Reader that works for all schemas.
Next, we need to specify the Schema Registry to use. We click the value of the "Schema Registry" property and choose to "Create new service..." We will use the AvroSchemaRegistry. (Note that our incoming data is in CSV format and the output will be in JSON. That's okay, Avro in this sense only refers to the format of the schema provided. So we will provide a schema in the same way that we would if we were using Avro.) We will click the "Create" button and then click the "Go To" arrow that appears in the right-hand column in order to jump to the Schema Registry service (and click 'Yes' to save changes to our CSV Reader service).
This will take us back to our Controller Services configuration screen. It is important to note that from this screen, each Controller Service has a "Usage" icon on the left-hand side (it looks like a little book). That icon will take you to the documentation on how to usage that specific Controller Service. The documentation is fairly extensive. Under the "Description" heading, each of the Record Readers and Writers has an "Additional Details..." link that provides much more detailed information about how to use service and provides examples.
We will click the Edit ("pencil") icon next to the newly created AvroSchemaRegistry and go to the Properties tab. Notice that the service has no properties, so we click the New Property ("+") icon in the top-right corner. The name of the property is the name that we will use to refer to the schema. Let's call it "hello-world". For the value, we can just type or paste in the schema that we want to use using the Avro Schema syntax. For this example, we will use the following schema:
{ "name": "helloWorld", "namespace": "org.apache.nifi.blogs", "type": "record", "fields": [ { "name": "purchase_no", "type": "long" }, { "name": "customer_id", "type": "long" }, { "name": "item_id", "type": ["null", "long"] }, { "name": "item_name", "type": ["null", "string"] }, { "name": "price", "type": ["null", "double"] }, { "name": "quantity", "type": ["null", "int"] }, { "name": "total_price", "type": ["null", "double"] } ] }
Now we can click "OK" and apply our changes. Clicking Enable (the lightning bolt icon) enables the service. We can now also enable our CSV Reader.
Similarly, we need to configure our writer with a schema so that NiFi knows how we expect our data to look. If we click the Pencil icon next to our JSONRecordSetWriter, in the Properties tab, we can configure whether we want our JSON data to be pretty-printed or not and how we want to write out date and time fields. We also need to specify how to access the schema and how to convey the schema to down-stream processing. For the "Schema Write Strategy," since we are using JSON, we will just set the "schema.name" attribute, so we will leave the default value. If we were writing in Avro, for example, we would probably want to include the schema in the data itself. For the "Schema Access Strategy," we will use the "Schema Name" property, and set the "Schema Name" property to "${schema.name}" just as we did with the CSV Reader. We then select the same AvroSchemaRegistry service for the "Schema Registry" property.
Again, we click "Apply" and then click the Lightning icon to enable our Controller Service and click the Enable button. We can then click the "X" to close out this dialog.
Now comes the fun part! We can go back to configure our QueryRecord Processor. In the Properties tab, we can start writing our queries. For this example, let's take the following CSV data:
purchase_no, customer_id, item_id, item_name, price, quantity 10280, 40070, 1028, Box of pencils, 6.99, 2 10280, 40070, 4402, Stapler, 12.99, 1 12440, 28302, 1029, Box of ink pens, 8.99, 1 28340, 41028, 1028, Box of pencils, 6.99, 18 28340, 41028, 1029, Box of ink pens, 8.99, 18 28340, 41028, 2038, Printer paper, 14.99, 10 28340, 41028, 4018, Clear tape, 2.99, 10 28340, 41028, 3329, Tape dispenser, 14.99, 10 28340, 41028, 5192, Envelopes, 4.99, 45 28340, 41028, 3203, Laptop Computer, 978.88, 2 28340, 41028, 2937, 24\" Monitor, 329.98, 2 49102, 47208, 3204, Powerful Laptop Computer, 1680.99, 1
In our Properties tab, we can click the "Add Property" button to add a new property. Because we can add multiple SQL queries in a single Processor, we need a way to distinguish the results of each query and route the data appropriately. As such, the name of the property is the name of the Relationship that data matching the query should be routed to. We will create two queries. The first will be named "over.1000" and will include the purchase_no and customer_id fields of any purchase that cost more than $1,000.00 and will also include a new field named total_price that is the dollar amount for the entire purchase. Note that when entering a value for a property in NiFi, you can use Shift + Enter to insert a newline in your value:
SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price FROM FLOWFILE GROUP BY purchase_no, customer_id HAVING SUM(price * quantity) > 1000
The second property will be named "largest.order" and will contain the purchase_no, customer_id, and total price of the most expensive single purchase (as defined by price times quantity) in the data:
SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price FROM FLOWFILE GROUP BY purchase_no, customer_id ORDER BY total_price DESC LIMIT 1
Now we will wire our QueryRecord processor up in our graph so that we can use it. For this demo, we will simply use a GenerateFlowFile to feed it data. We will set the "Custom Text" property to the CSV that we have shown above. In the "Scheduling" tab, I'll configure the processor to run once every 10 seconds so that I don't flood the system with data. We need to add a "schema.name" attribute, so we will route the "success" relationship of GenerateFlowFile to an UpdateAttribute processor. To this processor, we will add a new Property named "schema.name" with a value of "hello-world" (to match the name of the schema that we added to the AvroSchemaRegistry service). We will route the "success" relationship to QueryRecord.
Next, we will create two UpdateAttribute Processors and connect the "over.1000" relationship to the first and the "largest.order" relationship to the other. This just gives us a simple place to hold the data so that we can view it. I will loop the "failure" relationship back to the QueryRecord processor so that if there is a problem evaluating the SQL, the data will remain in my flow. I'll also auto-terminate the "original" relationship because once I have evaluated the SQL, I don't care about the original data anymore. I'll also auto-terminate the "success" relationship of each terminal UpdateAttribute processor. When we start the QueryRecord, GenerateFlowFile, and the first UpdateAttribute processors, we end up with a FlowFile queued up before each UpdateAttribute processor:
If we right-click on the "over.1000" connection and click "List queue," we are able to see the FlowFiles in the queue:
Clicking the "information" icon on the left-hand-side of the FlowFile gives us the ability to view the FlowFile's attributes and download or view its contents. We can click the "View" button to view the content. Changing the "View as" drop-down at the top from "original" to "formatted" gives us a pretty-printed version of the JSON, and we quickly see the results that we want:
Note that we have a null value for the columns that are defined in our schema that were not part of our results. If we wanted to, We could certainly update our schema in order to avoid having these fields show up at all.
Viewing the FlowFiles of the "largest.order" connection shows us a single FlowFile also, with the content that we expect there as well:
Of course, if we have already run the data through the flow and want to go back and inspect that data and what happened to it, we can easily do that through NiFi's powerful Data Provenance feature.