Member since
09-29-2015
58
Posts
76
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2035 | 01-25-2017 09:19 PM | |
2892 | 11-02-2016 02:54 PM | |
2972 | 09-08-2016 01:36 AM | |
5052 | 08-09-2016 07:52 PM | |
1336 | 06-30-2016 06:09 PM |
12-21-2017
09:06 PM
3 Kudos
Apache NiFi provides a great depth of information about each component on the canvas. Stats are collected for the number of FlowFiles in, the number of FlowFiles out, how many bytes were read or written, how long it took, etc. However, these stats that are displayed here are all "stock metrics." These metrics are the same for each processor. There are times when we may want to expose other metrics for a custom processor. One way that we can do this in NiFi is through the use of counters. The counters are a lesser-known feature of NiFi, and until recently have been pretty limited in their usefulness. This is due to the fact that upon startup, NiFi sets all counters to 0 and as the system runs, these counters are monotonically increasing. They count how many times something has happened since NiFi started. To see these counters, we can go to the drop-down menu in the top-right corner of the NiFi UI and click "Counters." We then are provided with a list of counters that have been registered: There is a button on the right-hand side that allows us to reset a counter, but when given the number 4,172,721 we have no idea how long it took to process that many records. Has it taken 1 second or 1 month? We also don't know whether the records were added all at once or evenly over a long period of time. In NiFi 1.4.0, though, we introduced a new feature that adds these counters to the Status History. So now, when we right-click on a processor and choose "View status history", the drop-down list of metrics that we can view now contains any counters that have been registered by the processor. For example, the UpdateRecord processor that I have running on my laptop shows the following metrics: So we can see that after starting, the number of Records Processed (per 5 minutes) quickly ramped up to about 58 million, then steadied out at a little over 50 million records per 5 minutes (or 167,000 records per second). Note here that the value that is charted is not the "total value" of the counter, as would be displayed in the Counters menu. Rather, this is charting how the counter changed over time, or how much the counter increased in a 5-minute window. This feature gives us a nice way to see how our processor is performing over time. All of the record-oriented processors (such as PartitionRecord, UpdateRecord, SplitRecord, QueryRecord, etc.) emit these counters. This is important for those of us who are more interested in the number of Events per Second that we're processing than in the number of gigabytes per second. However, we can easily count anything we want. DetectDuplicate, for instance, counts the number of FlowFiles that it has routed to "duplicate" and the number of FlowFiles that it has routed to "non-duplicate" giving us a feel not just for the processor's performance but also for how the data is changing over time. We can answer questions such as "Are we seeing more duplicates now, or fewer?" Hopefully this will help to give you some more insight into how your flow is working and how to emit your own custom metrics. Cheers!
... View more
Labels:
09-14-2017
02:11 PM
1 Kudo
@Mitthu Wagh the 5-minute window is not currently configurable. However, NiFi does take a snapshot of these values (and several others) every 1 minute (by default) and when you right-click on a Processor, you can go to Status History. This will graph that value over time. By default it will grab those 5-minute statistics every minute for 24 hours. This can be changed by updating the "nifi.components.status.repository.buffer.size" and "nifi.components.status.snapshot.frequency" properties in nifi.properties. You could also get these values via a Reporting Task if you wanted to push them somewhere else such as Grafana for longer-term storage.
... View more
06-29-2017
07:27 PM
4 Kudos
Apache NiFi has grown tremendously over the past 2 and a half years since it was open sourced. The community is continuously thinking of, implementing, and contributing amazing new features. The newly released version 1.2.0 of NiFi is no exception. One of the most exciting features of this new release is the QueryRecord Processor and the Record Reader and Record Writer components that go along with it. If you aren't familiar with those components, there's a blog post that explains how those work. This new Processor, powered by Apache Calcite, allows users to write SQL SELECT statements to run over their data as it streams through the system. Each FlowFile in NiFi can be treated as if it were a database table named FLOWFILE. These SQL queries can be used to filter specific columns or fields from your data, rename those columns/fields, filter rows, perform calculations and aggregations on the data, route the data, or whatever else you may want to use SQL for. All from the comfy confines of the most widely known and used Domain Specific Language. This is a big deal! Of course, there are already other platforms that allow you to run SQL over arbitrary data, though. So let's touch on how NiFi differs from other platforms that run SQL over arbitrary data outside of an RDBMS:
Queries are run locally. There is no need to push the data to some external service, such as S3, in order to run queries over your data. There's also no need to pay for that cloud storage or the bandwidth, or to create temporary "staging tables" in a database. Queries are run inline. Since your data is already streaming through NiFi, it is very convenient to add a new QueryRecord Processor to your canvas. You are already streaming your data through NiFi, aren't you? If not, you may find Our Docs Page helpful to get started learning more about NiFi. Query data in any format. Write results in any data format. One of the goals of this effort was to allow data to be queried regardless of the format. In order to accomplish this, the Processor was designed to be configured with a "Record Reader" Controller Service and a "Record Writer" Controller Service. Out of the box, there are readers for CSV, JSON, Avro, and even log data. The results of the query can be written out in CSV, JSON, Avro, or free-form text (for example, a log format) using the NiFi Expression Language. If your data is in another format, you are free to write your own implementation of the Record Reader and/or Record Writer Controller Service. A simple implementation can wrap an existing library to return Record objects from an InputStream. That's all that is needed to run SQL over your own custom data format! Writing the results in your own custom data format is similarly easy. It's fast - very fast! Just how fast largely depends on the performance of your disks and the number of disks that you have available. The data must be read from disk, queried, and then the results written to disk. In most scenarios, reading of the data from disk is actually avoided due to operating system disk caching, though. We've tested the performance on a server with 32 cores and 64 GB RAM. We ran a continuous stream of JSON-formatted Provenance data through the Processor. To do this, we used theNiFi Provenance Reporting Task to send the Provenance data back to the NiFi instance. Because we wanted to stress the Processor, we then used a DuplicateFlowFile processor to create 200 copies of the data (this really just creates 200 pointers to the same piece of data; it doesn't copy the data itself). We used a SQL query to pull out any Provenance "SEND" event (a small percentage of the data, in this case). Using 12 concurrent tasks, we were able to see the query running at a consistent rate of about 1.2 GB per second on a single node - using less than half of the available CPU! Data Provenance keeps a detailed trail of what happened. One of the biggest differentiators between NiFi and other dataflow platforms is the detailed Data Provenance that NiFi records. If the data that you end up with is not what you expect, the Data Provenance feature makes it easy to see exactly what the data looked like at each point in the flow and pinpoint exactly what when wrong - as well as understand where the data come from and where the data went. When the flow has been updated, the data can be replayed with the click of a button and the new results can be verified. If the results are still not right, update your flow and replay again. HOW IT WORKS In order to get started, we need to add a QueryRecord Processor to the graph. Once we've added the Processor to the graph, we need to configure three things: the Controller Service to use for reading data, the service to use for writing the results, and one or more SQL queries. The first time that you set this all up, it may seem like there's a lot going on. But once you've done it a couple of times, it becomes pretty trivial. The rest of this post will be dedicated to setting everything up. First, we will configure the Record Reader Controller Service. We'll choose to create a new service: We are given a handful of different types of services to choose from. For this example, we will use CSV data, so we will use a CSVReader: We will come back and configure our new CSVReader service in a moment. For now, we will click the "Create" button and then choose to create a new Controller Service to write the results. We will write the results in JSON format: We can again click the "Create" button to create the service. Now that we have created our Controller Services, we will click the "Go To" button on our reader service to jump to the Controller Service configuration page: CONFIGURING THE READER We can click the pencil in the right-hand column to configure our CSV Reader: The CSV Reader gives us plenty of options to customize the reader to our format, as can be seen in the above image. For this example, we will leave most of the defaults, but we will change the "Skip Header Line" Property from the default value of "false" to "true" because our data will contain a header line that we don't want to process as an actual record. In order to run SQL over our data and make sense of the columns in our data, we need to also configure the reader with a schema for the data. We have several options for doing this. We can use an attribute on the FlowFile that includes an Avro-formatted schema. Or we can use a Schema Registry to store our schema and access it by name or by identifier and version. Let's go ahead and use a local Schema Registry and add our schema to that Registry. To do so, we will change the "Schema Access Strategy" to "Use 'Schema Name' Property." This means that we want to look up the Schema to use by name. The name of the schema is specified by the "Schema Name" Property. The default value for that property is "${schema.name}", which means that we will just use the "schema.name" attribute to identify which schema we want to use. Instead of using an attribute, we could just type the name of the schema here. Doing so would mean that we would need a different CSV Reader for each schema that we want to read, though. By using an attribute, we can have a single CSV Reader that works for all schemas. Next, we need to specify the Schema Registry to use. We click the value of the "Schema Registry" property and choose to "Create new service..." We will use the AvroSchemaRegistry. (Note that our incoming data is in CSV format and the output will be in JSON. That's okay, Avro in this sense only refers to the format of the schema provided. So we will provide a schema in the same way that we would if we were using Avro.) We will click the "Create" button and then click the "Go To" arrow that appears in the right-hand column in order to jump to the Schema Registry service (and click 'Yes' to save changes to our CSV Reader service). This will take us back to our Controller Services configuration screen. It is important to note that from this screen, each Controller Service has a "Usage" icon on the left-hand side (it looks like a little book). That icon will take you to the documentation on how to usage that specific Controller Service. The documentation is fairly extensive. Under the "Description" heading, each of the Record Readers and Writers has an "Additional Details..." link that provides much more detailed information about how to use service and provides examples. We will click the Edit ("pencil") icon next to the newly created AvroSchemaRegistry and go to the Properties tab. Notice that the service has no properties, so we click the New Property ("+") icon in the top-right corner. The name of the property is the name that we will use to refer to the schema. Let's call it "hello-world". For the value, we can just type or paste in the schema that we want to use using the Avro Schema syntax. For this example, we will use the following schema: {
"name": "helloWorld",
"namespace": "org.apache.nifi.blogs",
"type": "record",
"fields": [
{ "name": "purchase_no", "type": "long" },
{ "name": "customer_id", "type": "long" },
{ "name": "item_id", "type": ["null", "long"] },
{ "name": "item_name", "type": ["null", "string"] },
{ "name": "price", "type": ["null", "double"] },
{ "name": "quantity", "type": ["null", "int"] },
{ "name": "total_price", "type": ["null", "double"] }
]
}
Now we can click "OK" and apply our changes. Clicking Enable (the lightning bolt icon) enables the service. We can now also enable our CSV Reader. CONFIGURING THE WRITER
Similarly, we need to configure our writer with a schema so that NiFi knows how we expect our data to look. If we click the Pencil icon next to our JSONRecordSetWriter, in the Properties tab, we can configure whether we want our JSON data to be pretty-printed or not and how we want to write out date and time fields. We also need to specify how to access the schema and how to convey the schema to down-stream processing. For the "Schema Write Strategy," since we are using JSON, we will just set the "schema.name" attribute, so we will leave the default value. If we were writing in Avro, for example, we would probably want to include the schema in the data itself. For the "Schema Access Strategy," we will use the "Schema Name" property, and set the "Schema Name" property to "${schema.name}" just as we did with the CSV Reader. We then select the same AvroSchemaRegistry service for the "Schema Registry" property. Again, we click "Apply" and then click the Lightning icon to enable our Controller Service and click the Enable button. We can then click the "X" to close out this dialog. WRITE THE SQL Now comes the fun part! We can go back to configure our QueryRecord Processor. In the Properties tab, we can start writing our queries. For this example, let's take the following CSV data: purchase_no, customer_id, item_id, item_name, price, quantity
10280, 40070, 1028, Box of pencils, 6.99, 2
10280, 40070, 4402, Stapler, 12.99, 1
12440, 28302, 1029, Box of ink pens, 8.99, 1
28340, 41028, 1028, Box of pencils, 6.99, 18
28340, 41028, 1029, Box of ink pens, 8.99, 18
28340, 41028, 2038, Printer paper, 14.99, 10
28340, 41028, 4018, Clear tape, 2.99, 10
28340, 41028, 3329, Tape dispenser, 14.99, 10
28340, 41028, 5192, Envelopes, 4.99, 45
28340, 41028, 3203, Laptop Computer, 978.88, 2
28340, 41028, 2937, 24\" Monitor, 329.98, 2
49102, 47208, 3204, Powerful Laptop Computer, 1680.99, 1
In our Properties tab, we can click the "Add Property" button to add a new property. Because we can add multiple SQL queries in a single Processor, we need a way to distinguish the results of each query and route the data appropriately. As such, the name of the property is the name of the Relationship that data matching the query should be routed to. We will create two queries. The first will be named "over.1000" and will include the purchase_no and customer_id fields of any purchase that cost more than $1,000.00 and will also include a new field named total_price that is the dollar amount for the entire purchase. Note that when entering a value for a property in NiFi, you can use Shift + Enter to insert a newline in your value: SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
HAVING SUM(price * quantity) > 1000
The second property will be named "largest.order" and will contain the purchase_no, customer_id, and total price of the most expensive single purchase (as defined by price times quantity) in the data: SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
ORDER BY total_price DESC
LIMIT 1
Now we will wire our QueryRecord processor up in our graph so that we can use it. For this demo, we will simply use a GenerateFlowFile to feed it data. We will set the "Custom Text" property to the CSV that we have shown above. In the "Scheduling" tab, I'll configure the processor to run once every 10 seconds so that I don't flood the system with data. We need to add a "schema.name" attribute, so we will route the "success" relationship of GenerateFlowFile to an UpdateAttribute processor. To this processor, we will add a new Property named "schema.name" with a value of "hello-world" (to match the name of the schema that we added to the AvroSchemaRegistry service). We will route the "success" relationship to QueryRecord.
Next, we will create two UpdateAttribute Processors and connect the "over.1000" relationship to the first and the "largest.order" relationship to the other. This just gives us a simple place to hold the data so that we can view it. I will loop the "failure" relationship back to the QueryRecord processor so that if there is a problem evaluating the SQL, the data will remain in my flow. I'll also auto-terminate the "original" relationship because once I have evaluated the SQL, I don't care about the original data anymore. I'll also auto-terminate the "success" relationship of each terminal UpdateAttribute processor. When we start the QueryRecord, GenerateFlowFile, and the first UpdateAttribute processors, we end up with a FlowFile queued up before each UpdateAttribute processor: If we right-click on the "over.1000" connection and click "List queue," we are able to see the FlowFiles in the queue: Clicking the "information" icon on the left-hand-side of the FlowFile gives us the ability to view the FlowFile's attributes and download or view its contents. We can click the "View" button to view the content. Changing the "View as" drop-down at the top from "original" to "formatted" gives us a pretty-printed version of the JSON, and we quickly see the results that we want: Note that we have a null value for the columns that are defined in our schema that were not part of our results. If we wanted to, We could certainly update our schema in order to avoid having these fields show up at all. Viewing the FlowFiles of the "largest.order" connection shows us a single FlowFile also, with the content that we expect there as well: Of course, if we have already run the data through the flow and want to go back and inspect that data and what happened to it, we can easily do that through NiFi's powerful Data Provenance feature.
... View more
Labels:
06-13-2017
03:25 PM
@Oleksandr Solomko have you changed the default value of the "nifi.queue.swap.threshold" property in nifi.properties? If so, you may be running into NIFI-3897.
... View more
06-06-2017
03:28 PM
5 Kudos
Note that this is a re-post of the original article that was posted at https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi. INTRO - THE WHAT Apache NiFi is being used by many companies and organizations to power their data distribution needs. One of NiFi's strengths is that the framework is data agnostic. It doesn't care what type of data you are processing. There are processors for handling JSON, XML, CSV, Avro, images and video, and several other formats. There are also several general-purposes processors, such as RouteText and CompressContent. Data can be hundreds of bytes or many gigabytes. This makes NiFi a powerful tool for pulling data from external sources; routing, transforming, and aggregating it; and finally delivering it to its final destinations. Note that this While this ability to handle any arbitrary data is incredibly powerful, we often see users working with record-oriented data. That is, large volumes of small "records," or "messages," or "events." Everyone has their own way to talk about this data, but we all mean the same thing. Lots of really small, often structured, pieces of information. This data comes in many formats. Most commonly, we see CSV, JSON, Avro, and log data. While there are many tasks that NiFi makes easy, there are some common tasks that we can do better with. So in version 1.2.0 of NiFi, we released a new set of Processors and Controller Services, for working with record-oriented data. The new Processors are configured with a Record Reader and a Record Writer Controller Service. There are readers for JSON, CSV, Avro, and log data. There are writers for JSON, CSV, and Avro, as well as a writer that allows users to enter free-form text. This can be used to write data out in a log format, like it was read in, or any other custom textual format. THE WHY These new processors make building flows to handle this data simpler. It also means that we can build processors that accept any data format without having to worry about the parsing and serialization logic. Another big advantage of this approach is that we are able to keep FlowFiles larger, each consisting of multiple records, which results in far better performance. THE HOW - EXPLANATION In order to make sense of the data, Record Readers and Writers need to know the schema that is associated with the data. Some Readers (for example, the Avro Reader) allow the schema to be read from the data itself. The schema can also be included as a FlowFile attribute. Most of the time, though, it will be looked up by name from a Schema Registry. In this version of NiFi, two Schema Registry implementations exist: an Avro-based Schema Registry service and a client for an external Hortonworks Schema Registry. Configuring all of this can feel a bit daunting at first. But once you've done it once or twice, it becomes rather quick and easy to configure. Here, we will walk through how to set up a local Schema Registry service, configure a Record Reader and a Record Writer, and then start putting to use some very powerful processors. For this post, we will keep the flow simple. We'll ingest some CSV data from a file and then use the Record Readers and Writers to transform the data into JSON, which we will then write to a new directory. THE HOW - TUTORIAL In order to start reading and writing the data, as mentioned above, we are going to need a Record Reader service. We want to parse CSV data and turn that into JSON data. So to do this, we will need a CSV Reader and a JSON Writer. So we will start by clicking the "Settings" icon on our Operate palette. This will allow us to start configuring our Controller Services. When we click the 'Add' button in the top-right corner, we see a lot of options for Controller Services: Since we know that we want to read CSV data, we can type "CSV" into our Filter box in the top-right corner of the dialog, and this will narrow down our choices pretty well: We will choose to add the CSVReader service and then configure the Controller Service. In the Properties tab, we have a lot of different properties that we can set: Fortunately, most of these properties have default values that are good for most cases, but you can choose which delimiter character you want to use, if it's not a comma. You can choose whether or not to skip the first line, treating it as a header, etc. For my case, I will set the "Skip Header Line" property to "true" because my data contains a header line that I don't want to process as a record. The first properties are very important though. The "Schema Access Strategy" is used to instruct the reader on how to obtain the schema. By default, it is set to "Use String Fields From Header." Since we are also going to be writing the data, though, we will have to configure a schema anyway. So for this demo, we will change this strategy to "Use 'Schema Name' Property." This means that we are going to lookup the schema from a Schema Registry. As a result, we are now going to have to create our Schema Registry. If we click on the "Schema Registry" property, we can choose to "Create new service...": We will choose to create an AvroSchemaRegistry. It is important to note here that we are reading CSV data and writing JSON data - so why are we using an Avro Schema Registry? Because this Schema Registry allows us to convey the schema using the Apache Avro Schema format, but it does not imply anything about the format of the data being read. The Avro format is used because it is already a well-known way of storing data schemas. Once we've added our Avro Schema Registry, we can configure it and see in the Properties tab that it has no properties at all. We can add a schema by adding a new user-defined property (by clicking the 'Add' / 'Plus' button in the top-right corner). We will give our schema the name "demo-schema" by using this as the name of the property. We can then type or paste in our schema. For those unfamiliar with Avro schemas, it is a JSON formatted representation that has a syntax like the following: {
"name": "recordFormatName",
"namespace": "nifi.examples",
"type": "record",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "gender", "type": "string" }
]
}
Here, we have a simple schema that is of type "record." This is typically the case, as we want multiple fields. We then specify all of the fields that we have. There is a field named "id" of type "int" and all other fields are of type "string." See the Avro Schema documentation for more information. We've now configured our schema! We can enable our Controller Services now. We can now add our JsonRecordSetWriter controller service as well. When we configure this service, we see some familiar options for indicating how to determine the schema. For the "Schema Access Strategy," we will again use the "Use 'Schema Name' Property," which is the default. Also note that the default value for the "Schema Name" property uses the Expression Language to reference an attribute named "schema.name". This provides a very nice flexibility, because now we can re-use our Record Readers and Writers and simply convey the schema by using an UpdateAttribute processor to specify the schema name. No need to keep creating Record Readers and Writers. We will set the "Schema Registry" property to the AvroSchemaRegistry that we just created and configured. Because this is a Record Writer instead of a Record Reader, we also have another interesting property: "Schema Write Strategy." Now that we have configured how to determine the data's schema, we need to tell the writer how to convey that schema to the next consumer of the data. The default option is to add the name of the schema as an attribute. We will accept the default. But we could also write the entire schema as a FlowFile attribute or use some strategies that are useful for interacting with the Hortonworks Schema Registry.
Now that we've configured everything, we can apply the settings and start our JsonRecordSetWriter as well. We've now got all of our Controller Services setup and enabled: Now for the fun part of building our flow! The above will probably take about 5 minutes, but it makes laying out the flow super easy! For our demo, we will have a GetFile processor bring data into our flow. We will use UpdateAttribute to add a "schema.name" attribute of "demo-schema" because that is the name of the schema that we configured in our Schema Registry: We will then use the ConvertRecord processor to convert the data into JSON. Finally, we want to write the data out using PutFile: We still need to configure our ConvertRecord processor, though. To do so, all that we need to configure in the Properties tab is the Record Reader and Writer that we have already configured: Now, starting the processors, we can see the data flowing through our system! Also, now that we have defined these readers and writers and the schema, we can easily create a JSON Reader, and an Avro Writer, for example. And adding additional processors to split the data up, query and route the data becomes very simple because we've already done the "hard" part. CONCLUSION In version 1.2.0 of Apache NiFi, we introduced a handful of new Controller Services and Processors that will make managing dataflows that process record-oriented data much easier. I fully expect that the next release of Apache NiFi will have several additional processors that build on this. Here, we have only scratched the surface of the power that this provides to us. We can delve more into how we are able to transform and route the data, split the data up, and migrate schemas. In the meantime, each of the Controller Services for reading and writing records and most of the new processors that take advantage of these services have fairly extensive documentation and examples. To see that information, you can right-click on a processor and click "Usage" or click the "Usage" icon on the left-hand side of your controller service in the configuration dialog. From there, clicking the "Additional Details..." link in the component usage will provide pretty significant documentation. For example, the JsonTreeReader provides a wealth of knowledge in its Additional Details.
... View more
Labels:
04-21-2017
03:57 PM
@John T What are your settings in the Scheduling tab of the processor? I.e., if you right-click the processor and go to Configure, then click the Scheduling tab, is the Scheduling Strategy set to "Timer Driven"? What is the "Run Schedule" set to?
... View more
01-25-2017
09:19 PM
1 Kudo
@Adda Fuentes it is simply an indication that a specific node seems to be responding more slowly than others. If you keep seeing that warning for a specific node, then it may be worth looking into why that specific node is taking longer than the others. If you are seeing good performance and the UI is responding within a reasonable amount of time, you can ignore the warning. However, since requests are replicated across the cluster, if any of the nodes starts to behave poorly, the responsiveness of the UI can suffer. This warning is intended to point you in the right direction to find out why the application responsiveness is degrading.
... View more
01-13-2017
10:06 PM
That looks like there is no process running with that PID. So NiFi did indeed stop. It's very strange that the logs show nothing of that nature, though...
... View more
01-13-2017
09:20 PM
@Ranjit S If you do a cat run/nifi.pid does it contain a valid PID? I.e., if you do "ps -ef | grep <pid>" does that show a process that is currently running?
... View more
01-13-2017
08:47 PM
I agree - that would be useful. Any chance that there are permissions issues at play here? The user running NiFi will need write permissions to the $NIFI_HOME directory and all subdirectories.
... View more