Member since
02-01-2022
285
Posts
103
Kudos Received
60
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1181 | 05-15-2025 05:45 AM | |
| 5119 | 06-12-2024 06:43 AM | |
| 8115 | 04-12-2024 06:05 AM | |
| 5995 | 12-07-2023 04:50 AM | |
| 3297 | 12-05-2023 06:22 AM |
06-13-2022
06:38 AM
1 Kudo
I have done similar here when I need to deliver jar files to all nodes. It's really a "this is not how things are done", but in this case I did not have access to the node's file system without doing this in a flow. So that said, it works great! The first proc creates a flowfile on all nodes (even when I dont know the number), then it checks, if not found, proceeds to get the file and write it to the file system.
... View more
06-07-2022
05:20 AM
Fun with python, you are going to need to resolve all dependencies. I am not familiar with the last error, but its definitely saying psycopg2 is not found..
... View more
06-01-2022
10:17 AM
1 Kudo
Yes, not available before 1.16. Definitely a great new feature!!
... View more
06-01-2022
07:48 AM
1 Kudo
@leandrolinof I believe you are looking for a brand new nifi feature found in 1.16 which allows you to control failure and retry: Framework Level Retry now supported. For many years users build flows in various ways to make retries happen for a configured number of attempts. Now this is easily and cleanly configured in the UI/API and simplifies the user experience and flow design considerably! To those waiting for years for this thank you for your patience. Reference: https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.16.0 You can find more about whats new in NiFi 1.16 in this video below. https://www.youtube.com/watch?v=8G6niPKntTc Mark also shows a bit of the new retry mechanism around 11:50
... View more
05-31-2022
05:48 AM
@dfdf as the error suggests: You need to install the mysql connector. I believe this link will get you there: https://docs.cloudera.com/csa/1.3.0/installation/topics/csa-ssb-configuring-mysql.html#ariaid-title3
... View more
05-24-2022
06:02 AM
1 Kudo
@FediMannoubi Below is a basic approach to solve. Assuming both postgres tables are populated with rows per your example, your nifi flow would need to get the CSV (various ways to do that), once the contents of the csv are in a flowfile (i use GenerateFlowFile processor), you can use a RecordReader based processor to read the csv. This will allow you to write SQL against the flowfile with QueryRecord to get a single value. For example: SELECT city_name FROM FLOWFILE Next, in your flow you will need to get the city_name value into an attribute, i use EvaluateJsonPath. After that a ExecuteSQL processor and associated DBCP Connection pool to postgres. Then in ExecuteSQL your query is SELECT city_id FROM CITY WHERE city_name=${city_name} At the end of this flow you will have the city_name from csv, and city_id from postgres. You can now combine or use the further downstream to suit your needs. INSERT is done similarly, once you have the data in flowfiles, or attributes, using the same ExecuteSQL you write an insert instead. My test flow looks like this, but forgive the end, as I did not actually have a postgres database setup. You can find this sample flow [here]. I hope this gets you pointed in the right direction for reading csv and querying data from database.
... View more
05-17-2022
12:35 PM
Nice one sir!
... View more
05-17-2022
12:34 PM
@joshtheflame CDP Private Cloud Base, for on prem, is able to be deployed on openshift kubernetes. CDP Public Cloud, in Aws, Azure, or GCP is fully kubernetes deployed in the respective cloud kubernetes platforms. CDP is Hybrid and Multi-Cloud capable as well. Check out CDP Private Cloud Base: https://docs.cloudera.com/data-warehouse/1.3.1/openshift-environments/topics/dw-private-cloud-openshift-environments-overview.html and CDP Public Cloud: https://docs.cloudera.com/cdp/latest/overview/topics/cdp-overview.html
... View more
04-28-2022
05:01 AM
Do not think of the existence number of processors (concurrency) and the run schedule for that process as relating to request/response timing. The request/response time could be almost instant, to as long as your other end takes to respond specifically in reference to InvokeHttp. The number of processors (concurrency) is used to help gain a higher number of unique instances running against that proccessor maybe and usallly to help drain a huge queue of flowfiles (1000s,10000s,1000000s,etc). Run schedule is how long that one instance stays active (able to process more than 1 flowfile in sequence). Hope this helps, Steven
... View more
04-26-2022
06:24 AM
2 Kudos
@jonay__reyes I think by default you will see the result you are expecting, however, the expected limit of 5 concurrent connections may be a challenge. Let's address your questions first: Does this translate to simply using 1 InvokeHTTP processor configured to 5 "Concurrent Tasks" and that's it? - 1 proc w/ 5 concurrent tasks, will provide what is in effect 5 instance copies and they can run more than 5 requests each if there are ample flowfiles queued up. So, NO. For your use case, i would recommend that you set it to 1, and control the # of flowfiles upstream. Will the processor wait for the remote endpoint's request before sending the next one? YES if concurrent task set to 1. NO, if set higher (2+) they will execute in parallel How does the "Run Schedule" works together with the previous settings? (if I had, e.g.: 1 sec). Run Schedule sets how long a process will operate before a new instance is necessary. If the request/response times are low, this setting will allow you to push more data through each instance without creating separate processes for each. If the request/response time is high, you can use this to help with long execution. Experiment carefully here. I've been proposed with splitting the incoming queue and put 5 InvokeHTTP processors in parallel, each one attending 1/5 of the incoming flowfiles (I'd do the pre-partitioning before with some RouteOnAttribute trick), but I think it's exactly the same outcome as the 1. above. Is it? Correct, there is no reason to do this, avoid duplicating processors For concurrent tasks and run schedule adjustments, you should always experiment in small increments, changing one setting at a time, evaluating, and repeating until you find the right balance. I suspect that you will not need 5 long executing request/responses in parallel, and that even with default settings, your queued flowfiles will execute fast enough to appear "simultaneous".
... View more