Member since
01-07-2019
220
Posts
23
Kudos Received
30
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11999 | 08-19-2021 05:45 AM | |
| 3152 | 08-04-2021 05:59 AM | |
| 1543 | 07-22-2021 08:09 AM | |
| 6034 | 07-22-2021 08:01 AM | |
| 5494 | 07-22-2021 07:32 AM |
08-13-2020
04:44 AM
This message is labeled NiFi, so I assume you have NiFi available? In that case, look at finding the right processor for the job, something like ExecuteSQL may be a good starting point. ---- If your question is purely about how to make python and mariaDB interact, this may not be the best place to ask it.
... View more
08-13-2020
04:33 AM
1 Kudo
Nifi is not really designed to work with 'context'. If you have a simple record there are many operations which you can do, but if you are working with potentially complex files and thus complex operations, you will likely rather process them with something like spark or python.
... View more
08-13-2020
04:31 AM
1 Kudo
Assuming your flowfile contains multiple records, this should probably be achievable with the UpdateRecord processor. Note that the expression language has a UUID function which may be helpful to use inside this.
... View more
08-12-2020
02:16 PM
I have not tested it as such, but 10 minutes is a very long time for NiFi, I could imagine that the 'memory' of a load balancer is quite short to avoid overhead. I would just drop in many files in a short period of time and see what happens. If you set the load balancer to round robin I would expect the files to go to both nodes. Note that you can also use generateflowfile if you don't want to drop files all the time.
... View more
08-12-2020
01:58 PM
The ListSFTP processor does not actually do anything with the file, it just builds a list of files that exist. Typically this would then feed into a GetSFTP processor. In the GetSFTP processor you can configure whether the original should be deleted, by default this would indeed happen. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.GetSFTP/index.html
... View more
07-29-2020
02:27 PM
Thanks, will think on refining the distinction between kudu and druid. Currently i would not want to include the fact that flink has state as 'storage', but regarding flink SQL, i may actually make another post later to talk about the way to interact with/access different kinds of data. (As someone also noticed, impala is also not here because it is not a store in itself, but works with stored data).
... View more
07-28-2020
11:33 AM
3 Kudos
The Cloudera Data Platform (CDP) comes with many places to store your data, and it can be challenging to know which one to use. Though there is no formal decision tree, I hereby share the key considerations from my personal perspective. They can be visualized like this: Explanation of each path a. Have large bulky files, that do not need to be queried > File and object storage The exact kind of storage to be used will mostly be defined by your environment, in a classical cluster HDFS is available. In the public cloud, each provider object store will be leveraged, and on-premises Ozone will serve as the object-store. b. Have a table, either from large bulky files, or a set of messages > Hive for scale or Kudu for interaction If you want to work with a table, and need to store it as such, it is clear you want to store your data as a table. Even if this may force you to think about how to implement the ingest in a sensible way. Kudu is great for fast insights, where hive tables (which in turn can be of different formats) can offer an unlimited scale. Note that Hive tables (registered in the Hive Metastore) can be accessed via different means, including the Hive engine and the Impala engine. c. Does your table records stream in, but you only need pre-aggregates > Druid Druid is able to aggregate data upon ingestion. d. Are you working with messages or small files > Kafka for latency or HBase for retention Kafka and Hbase are both great places to put 'many tiny things', for instance, individual transactions. Kafka offers great throughput and latency, but despite commonly used marketing messages, it is not a database and does not scale well for historical data. If you want to serve data granularly for a longer period of time, Hbase is a great fit for this. Some notes: When working in the cloud, it is often desirable to work with object stores where possible to keep costs down. The good news is that CDP Public Cloud comes with cloud native capabilities. As such several storage solutions, such as Hive, actually store the data in cloud object stores. It is possible that more than one road applies to your data. For instance, a message may require very low latency in the first few days, but also needs to be retained for several years. In such cases, it often makes sense to store a subset of the data in two places. I did not include other solutions that could store data, such as solr, or in-application state. The reason is that the primary function of these is not storage, but search and processing respectively. I also did not include Impala as it is an engine, Hive is only on this chart to represent its storage capabilities. This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this. Also, see my related article: Find the right tool to move your data Full Disclosure & Disclaimer: I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.
... View more
07-25-2020
01:52 AM
There will be multiple form factors available in the future, for now, I will assume you have an environment that contains 1 datahub with NiFi, and 1 Data Hub with both Impala and Kudu. (The answer still works if all are on the same datahub).
Prerequisites
Data Hubs with NiFi and Impala+Kudu
Permission to access these (e.g. add a processor, create table)
Know your Workload User Name (CDP Management Console > Your name (bottom left) > Profile)
You should have set your Workload Password in the same location
Steps to write data from NiFi to Kudu in CDP Public Cloud
Unless mentioned otherwise, I have kept everything to its default settings.
In Kudu Data Hub Cluster:
Gather the FQDN links of the brokers and the used ports. Go to the Data Hub > Click Kudu Master > Click Masters.
Combine the RPC addresses together in the following format: host1:port,host2:port,host3:port
Example:
```master1fqdn.abc:7051,master2fqdn.abc:7051,master3fqdn.abc:7051```
In HUE on Impala/Kudu Data Hub Cluster:
Run the following query to create the kudu table (the little triangle makes it run):
`CREATE TABLE default.micro (id BIGINT, name STRING, PRIMARY KEY(id)) STORED AS KUDU;`
In NiFi GUI:
Ensure that you have some data in NiFi to that fits in the table. Use the `GenerateFlowFile` processor.
In Properties, configure the Custom Text to contain the following (copy carefully or use shift+enter for newline): id, name
1,dennis
Select the PutKudu processor, configure it as the following: - Settings
- Automatically terminate relationships: Tick both success and faillure
- Scheduling
- 1 sec
- Properties
- Kudu Masters: The combined list we created earlier
- Table Name: impala::default.micro
- Kerberos Principal: your Workload User Name (see prerequisites above)
- Kerberos Password: your Workload Password
- Record Reader: Create new service>CSV Reader
- Kudu Operation Type: UPSERT
Right-click the Nifi Canvas> Configure > The little cogwheel bolt of CSV Reader, set the following property and then apply: Treat First Line as Header: True
Click the little lightning bolt of CSV Reader > Enable
On the canvas connect your GenerateFlowFile processor to your PutKudu processor and start the flow.
You should now be able to select your table through HUE and see that a single record has been added: `select * from default.micro`
These are the minimal steps, a more extensive explanation can be found on in the Cloudera Documentation.
Potential refinements
A describe extended also exposes the hostnames. For me, this also worked, but I am unsure how safe it is not to explicitly define the ports.
... View more
Labels:
07-23-2020
01:49 PM
1 Kudo
The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises. Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this: Steps for finding the right tool to move your data Staying within Hive and SQL queries suffice? > Hive otherwise No complex operations (e.g. joins) > Nifi otherwise Batch > Spark otherwise Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise Flink Some notes: If you can use Nifi or a more complex solution, use Nifi Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines. Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this Also see my related article: Choose the right place to store your data Full Disclosure & Disclaimer: I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.
... View more
06-29-2020
01:10 PM
3 Kudos
Yesterday evening, in a moment of inspiration, I wondered if one could build a game in NiFi (a tool for moving data in realtime). Of course, building a game is quite unnecessary, but it is far from pointless as it can really show how flexible a tool can be with a bit of creativity. A few years ago I played ‘Worlds first game of chess… against PySpark’ but given that NiFi is meant for simpler logic, I decided to start with something different.
Building Tic Tac Toe in NiFi
My goals were to build something that allowed people across the globe to play in realtime, by using a board (no code). Fortunately, the Graphical User Interface of NiFi was made with realtime interaction of multiple users in mind, so I had a good foundation for this. Furthermore, I wanted NiFi to determine the winner automatically. For this, I leveraged its capabilities in applying logic/business rules. Of course, NiFi wants to apply the logic to something, so to ensure that the winner would be determined in a timely manner, I introduced a heartbeat that checks 10x per second whether someone made the winning move.
Let's get started
How did it go?
I was hoping I could say it only took 30 minutes, but admittedly I had to think a bit about how to make the evaluation work and represent things in a visually appealing way. So, ultimately it took me about 2 hours, which is still not bad for creating an interactive multiplayer game in a tool that is actually meant for moving data. I am quite happy with how easy it turns out to play the game. However, please note that I only built the logic for multiplayer Tic Tac Toe. NiFi actually has enough capabilities, so it would not be hard to make it decide its own moves at various levels of playing strength for a simple game like Tic Tac Toe.
Make your move Automatically evaluate the winner
What will come next?!
First of all, I will think about more cool stuff to do with tools for moving data, perhaps with NiFi or more complex engines like Flink. I hope I have shown that NiFi is indeed very flexible and enables the great speed of development.
Secondly, I invite you to think of something to do with NiFi or another open-source tooling. Good luck and do share the results!
Well played!
----
NiFi Template (right-click to save): Tic_Tac_Toe.xml
... View more
Labels: