Member since
01-07-2019
220
Posts
23
Kudos Received
30
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5050 | 08-19-2021 05:45 AM | |
1811 | 08-04-2021 05:59 AM | |
879 | 07-22-2021 08:09 AM | |
3696 | 07-22-2021 08:01 AM | |
3436 | 07-22-2021 07:32 AM |
07-25-2020
01:52 AM
There will be multiple form factors available in the future, for now, I will assume you have an environment that contains 1 datahub with NiFi, and 1 Data Hub with both Impala and Kudu. (The answer still works if all are on the same datahub).
Prerequisites
Data Hubs with NiFi and Impala+Kudu
Permission to access these (e.g. add a processor, create table)
Know your Workload User Name (CDP Management Console > Your name (bottom left) > Profile)
You should have set your Workload Password in the same location
Steps to write data from NiFi to Kudu in CDP Public Cloud
Unless mentioned otherwise, I have kept everything to its default settings.
In Kudu Data Hub Cluster:
Gather the FQDN links of the brokers and the used ports. Go to the Data Hub > Click Kudu Master > Click Masters.
Combine the RPC addresses together in the following format: host1:port,host2:port,host3:port
Example:
```master1fqdn.abc:7051,master2fqdn.abc:7051,master3fqdn.abc:7051```
In HUE on Impala/Kudu Data Hub Cluster:
Run the following query to create the kudu table (the little triangle makes it run):
`CREATE TABLE default.micro (id BIGINT, name STRING, PRIMARY KEY(id)) STORED AS KUDU;`
In NiFi GUI:
Ensure that you have some data in NiFi to that fits in the table. Use the `GenerateFlowFile` processor.
In Properties, configure the Custom Text to contain the following (copy carefully or use shift+enter for newline): id, name
1,dennis
Select the PutKudu processor, configure it as the following: - Settings
- Automatically terminate relationships: Tick both success and faillure
- Scheduling
- 1 sec
- Properties
- Kudu Masters: The combined list we created earlier
- Table Name: impala::default.micro
- Kerberos Principal: your Workload User Name (see prerequisites above)
- Kerberos Password: your Workload Password
- Record Reader: Create new service>CSV Reader
- Kudu Operation Type: UPSERT
Right-click the Nifi Canvas> Configure > The little cogwheel bolt of CSV Reader, set the following property and then apply: Treat First Line as Header: True
Click the little lightning bolt of CSV Reader > Enable
On the canvas connect your GenerateFlowFile processor to your PutKudu processor and start the flow.
You should now be able to select your table through HUE and see that a single record has been added: `select * from default.micro`
These are the minimal steps, a more extensive explanation can be found on in the Cloudera Documentation.
Potential refinements
A describe extended also exposes the hostnames. For me, this also worked, but I am unsure how safe it is not to explicitly define the ports.
... View more
Labels:
07-23-2020
01:49 PM
1 Kudo
The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises. Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this: Steps for finding the right tool to move your data Staying within Hive and SQL queries suffice? > Hive otherwise No complex operations (e.g. joins) > Nifi otherwise Batch > Spark otherwise Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise Flink Some notes: If you can use Nifi or a more complex solution, use Nifi Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines. Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this Also see my related article: Choose the right place to store your data Full Disclosure & Disclaimer: I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.
... View more
06-29-2020
01:10 PM
3 Kudos
Yesterday evening, in a moment of inspiration, I wondered if one could build a game in NiFi (a tool for moving data in realtime). Of course, building a game is quite unnecessary, but it is far from pointless as it can really show how flexible a tool can be with a bit of creativity. A few years ago I played ‘Worlds first game of chess… against PySpark’ but given that NiFi is meant for simpler logic, I decided to start with something different.
Building Tic Tac Toe in NiFi
My goals were to build something that allowed people across the globe to play in realtime, by using a board (no code). Fortunately, the Graphical User Interface of NiFi was made with realtime interaction of multiple users in mind, so I had a good foundation for this. Furthermore, I wanted NiFi to determine the winner automatically. For this, I leveraged its capabilities in applying logic/business rules. Of course, NiFi wants to apply the logic to something, so to ensure that the winner would be determined in a timely manner, I introduced a heartbeat that checks 10x per second whether someone made the winning move.
Let's get started
How did it go?
I was hoping I could say it only took 30 minutes, but admittedly I had to think a bit about how to make the evaluation work and represent things in a visually appealing way. So, ultimately it took me about 2 hours, which is still not bad for creating an interactive multiplayer game in a tool that is actually meant for moving data. I am quite happy with how easy it turns out to play the game. However, please note that I only built the logic for multiplayer Tic Tac Toe. NiFi actually has enough capabilities, so it would not be hard to make it decide its own moves at various levels of playing strength for a simple game like Tic Tac Toe.
Make your move Automatically evaluate the winner
What will come next?!
First of all, I will think about more cool stuff to do with tools for moving data, perhaps with NiFi or more complex engines like Flink. I hope I have shown that NiFi is indeed very flexible and enables the great speed of development.
Secondly, I invite you to think of something to do with NiFi or another open-source tooling. Good luck and do share the results!
Well played!
----
NiFi Template (right-click to save): Tic_Tac_Toe.xml
... View more
Labels:
01-26-2020
04:11 AM
Here are the standard steps for debugging scripts that fail in Nifi. 1. Please make sure the script works in general. 2. Then make sure that in your test you are on the same machine with the same rights and such. Usually this suffices, if it really fails we can dig more to search about this kind of problem.
... View more
01-26-2020
04:06 AM
1 Kudo
Though many of these kind of fields allow regular expressions, the doc for this one does not mention it. I would try to use a regex, but probably it will not work because the field allows comma separated input. From here you would need to get creative. First of all you could definitely add a filter based on a regex afterwards (in Route text). If this does not perform sufficiently you can perhaps try spaceBTspace but this is obviously a shortcut that will miss some results. (Also I did not test it). In this particular case BT preceded or followed by a space may capture most cases. --- So far the regular recommendations, if you really want to go all the way you could probably create your own get twitter, but I would start with the two step filter and see if it works for you.
... View more
01-20-2020
03:25 PM
It seems that you are looking to simulate batch processing in Nifi. In general this is not a good idea. Consider just processing the files as they come in. If that is not possible, perhaps you want to do some scheduling based on a trigger. For this tools like Oozie are made, they can trigger spark for instance (possibly even Nifi). If that doesn't work and you want a pure nifi solution it might be possible to set up a waiting processor that eventually gives up, but this will be more of a hack than a solution.
... View more
01-20-2020
10:08 AM
The update Attribute processor is for updating an individual flow file. It seems like you want to have a more global count. This kind of counting over time is usually called a window operation, and is not something Nifi is really designed for. In general you would use something like Spark for complex operations. In your specific case you could try something 'ugly' like letting Nifi execute an SQL statement that updates a field with +1 each time. However this obviously will not scale, and I am not sure if a correct outcome is guaranteed if two updates are initiated in parallel (that may be more of a DB question).
... View more
01-20-2020
10:01 AM
The first thing that comes to mind is the Extract Text processor. It allows you to get (multiple) parts from the text and put it into attributes. org.apache.nifi.processors.standard.ExtractText
... View more
01-20-2020
09:44 AM
Are you running a recent version of HDF? This thread suggests it may be an issue that has been resolved: http://apache-nifi.1125220.n5.nabble.com/Hive-w-Kerberos-Authentication-starts-failing-after-a-week-td23887.html
... View more
12-30-2019
09:42 AM
What to ask before making Data Flow When taking on a new responsibility for designing and maintaining data flows. What are the main question one should ask to ensure a good outcome? Here I list the key questions for important topics, as well as an illustration of what typically goes wrong if the questions are not asked. The most important points if you are under pressure to deliver Location The questions: Where is the data, where should it go (and where can I process it). And of course: Do I have the required access The Nightmare: Data is spread across multiple systems, one of these may not be identified. After you finally figure out which tables you need you try to start and don’t have access. When you finally get the data you either don’t have a compliant place to put it, or you are missing a tool. Finally you have the data but it is unclear how to get it written to the target. In the end a 3 day job takes 6 weeks. Context The questions: What is the data, and who understands the source/target? The Nightmare: You want to supply revenue data during business hours. First of all you get access to multiple tables, each containing various numbers which might be the revenue. After figuring out which one is the revenue, it turns out you have transactions from multiple timeones in and out of summer time which needs to be solved before moving it into the target application. Finally it turns out the target application needs fields not to be NULL and you have no idea what will happen if you use a wrong default. Process The questions: Who makes the specifications, and accepts the results. How to deal with the situation that the requirements change? (Or as it may be phrased, you did not understand them correctly). How to escalate if you are not put in circumstances where you can succeed? The Nightmare: The requirements are not completely clear. You make something, and get feedback you need to change one thing. After this, you need to change another thing. It is unclear whether these are refinements (from your perspective) or fixes (from their perspective), however when the deadline is not met it is clear where the finger will be pointed. The most important points if you want things to go right Complexity The questions: What exactly should be the output, what exactly needs to be done? The Nightmare: You build a data flow in Nifi, near the end the request comes to join two parts of the flow together, or do some complex windowing. Based on this kind of requirement you should have considered something like Spark, perhaps you need to redo some of the work to keep the flow logical, and introduce Kafka as well as a buffer in between. Supplier Commitment The questions: Who supplies the data. What is the SLA. Will I be informed if the structure changes? Will these changes be available for testing? Is the data supplier responsible for data quality? The Nightmare: You don't get a commitment, and suddenly your consumers start seeing wrong results. It turns out a column definition was changed and you were not informed. After this you get a message one of the smaller sources will be down for 12 hours, you need this to enrich your main source. So now you will be breaking the service level agreement to your consumers for a reason they may not want to understand. Nonfunctionals The questions: How big is the data, what is the expected througput. What is the required latency? The Nightmare: You design and test a flow with 10 messages per second, and buffers to cushion the volatility. You end up receiving 10000 messages per second. For this you may even need a bigger budget. After your througput (budget_ has been increased significantly, it turns out the buffers are too big and your throughput SLA is not met. Now you can go back to request an even larger compute capability. Of course there are other things to ask, such as requirements to work with specific (legacy) tooling, exact responsibilities per topic or security guidelines to abide by. But typically these are the things I consider to be the most critical and specific to working with data.
... View more
Labels: