Member since
01-07-2019
220
Posts
23
Kudos Received
30
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 12010 | 08-19-2021 05:45 AM | |
| 3155 | 08-04-2021 05:59 AM | |
| 1543 | 07-22-2021 08:09 AM | |
| 6043 | 07-22-2021 08:01 AM | |
| 5496 | 07-22-2021 07:32 AM |
01-26-2020
04:11 AM
Here are the standard steps for debugging scripts that fail in Nifi. 1. Please make sure the script works in general. 2. Then make sure that in your test you are on the same machine with the same rights and such. Usually this suffices, if it really fails we can dig more to search about this kind of problem.
... View more
01-26-2020
04:06 AM
1 Kudo
Though many of these kind of fields allow regular expressions, the doc for this one does not mention it. I would try to use a regex, but probably it will not work because the field allows comma separated input. From here you would need to get creative. First of all you could definitely add a filter based on a regex afterwards (in Route text). If this does not perform sufficiently you can perhaps try spaceBTspace but this is obviously a shortcut that will miss some results. (Also I did not test it). In this particular case BT preceded or followed by a space may capture most cases. --- So far the regular recommendations, if you really want to go all the way you could probably create your own get twitter, but I would start with the two step filter and see if it works for you.
... View more
01-20-2020
03:25 PM
It seems that you are looking to simulate batch processing in Nifi. In general this is not a good idea. Consider just processing the files as they come in. If that is not possible, perhaps you want to do some scheduling based on a trigger. For this tools like Oozie are made, they can trigger spark for instance (possibly even Nifi). If that doesn't work and you want a pure nifi solution it might be possible to set up a waiting processor that eventually gives up, but this will be more of a hack than a solution.
... View more
01-20-2020
10:08 AM
The update Attribute processor is for updating an individual flow file. It seems like you want to have a more global count. This kind of counting over time is usually called a window operation, and is not something Nifi is really designed for. In general you would use something like Spark for complex operations. In your specific case you could try something 'ugly' like letting Nifi execute an SQL statement that updates a field with +1 each time. However this obviously will not scale, and I am not sure if a correct outcome is guaranteed if two updates are initiated in parallel (that may be more of a DB question).
... View more
01-20-2020
10:01 AM
The first thing that comes to mind is the Extract Text processor. It allows you to get (multiple) parts from the text and put it into attributes. org.apache.nifi.processors.standard.ExtractText
... View more
01-20-2020
09:44 AM
Are you running a recent version of HDF? This thread suggests it may be an issue that has been resolved: http://apache-nifi.1125220.n5.nabble.com/Hive-w-Kerberos-Authentication-starts-failing-after-a-week-td23887.html
... View more
12-30-2019
09:42 AM
What to ask before making Data Flow When taking on a new responsibility for designing and maintaining data flows. What are the main question one should ask to ensure a good outcome? Here I list the key questions for important topics, as well as an illustration of what typically goes wrong if the questions are not asked. The most important points if you are under pressure to deliver Location The questions: Where is the data, where should it go (and where can I process it). And of course: Do I have the required access The Nightmare: Data is spread across multiple systems, one of these may not be identified. After you finally figure out which tables you need you try to start and don’t have access. When you finally get the data you either don’t have a compliant place to put it, or you are missing a tool. Finally you have the data but it is unclear how to get it written to the target. In the end a 3 day job takes 6 weeks. Context The questions: What is the data, and who understands the source/target? The Nightmare: You want to supply revenue data during business hours. First of all you get access to multiple tables, each containing various numbers which might be the revenue. After figuring out which one is the revenue, it turns out you have transactions from multiple timeones in and out of summer time which needs to be solved before moving it into the target application. Finally it turns out the target application needs fields not to be NULL and you have no idea what will happen if you use a wrong default. Process The questions: Who makes the specifications, and accepts the results. How to deal with the situation that the requirements change? (Or as it may be phrased, you did not understand them correctly). How to escalate if you are not put in circumstances where you can succeed? The Nightmare: The requirements are not completely clear. You make something, and get feedback you need to change one thing. After this, you need to change another thing. It is unclear whether these are refinements (from your perspective) or fixes (from their perspective), however when the deadline is not met it is clear where the finger will be pointed. The most important points if you want things to go right Complexity The questions: What exactly should be the output, what exactly needs to be done? The Nightmare: You build a data flow in Nifi, near the end the request comes to join two parts of the flow together, or do some complex windowing. Based on this kind of requirement you should have considered something like Spark, perhaps you need to redo some of the work to keep the flow logical, and introduce Kafka as well as a buffer in between. Supplier Commitment The questions: Who supplies the data. What is the SLA. Will I be informed if the structure changes? Will these changes be available for testing? Is the data supplier responsible for data quality? The Nightmare: You don't get a commitment, and suddenly your consumers start seeing wrong results. It turns out a column definition was changed and you were not informed. After this you get a message one of the smaller sources will be down for 12 hours, you need this to enrich your main source. So now you will be breaking the service level agreement to your consumers for a reason they may not want to understand. Nonfunctionals The questions: How big is the data, what is the expected througput. What is the required latency? The Nightmare: You design and test a flow with 10 messages per second, and buffers to cushion the volatility. You end up receiving 10000 messages per second. For this you may even need a bigger budget. After your througput (budget_ has been increased significantly, it turns out the buffers are too big and your throughput SLA is not met. Now you can go back to request an even larger compute capability. Of course there are other things to ask, such as requirements to work with specific (legacy) tooling, exact responsibilities per topic or security guidelines to abide by. But typically these are the things I consider to be the most critical and specific to working with data.
... View more
Labels:
12-28-2019
02:15 AM
1 Kudo
The first thing that comes to mind is not select $.data.* but something like $.data.file_attachment or $.data.file_attachment.* Does this bring you (closer) to the answer? If there are still simple things you want to change in the text, you could use this workaround: In the update Attribute processor use something like replaceall. Hope this helps, but also curious if there are other things relevant here.
... View more
12-27-2019
02:13 AM
Perhaps I missed it, but what is the exact error that you see? And just in case, what command do you use? And have you successfully ran similar commands? Also, you mentioned oozie, does that mean you can run the command outside oozie? And with the same user?
... View more
12-25-2019
08:51 AM
For access to the versions intended for production, you would indeed need to be a customer. However for a quick glance there are some more accessible paths, for instance the HDP sandbox, or the CDP free trial. These can be used by anyone without being a customer, but may not contain the latest version for instance.
... View more