Created 09-06-2024 01:42 PM
Hi all, I'm a newbie. I want to read a local .csv file that contains only an ID column, then use that list of ID in a dynamic query to get data from an oracle table. My flow is as below:
I configure processors as following:
+ReplaceText(1): is to replace "\n" with ","
+ ReplaceText(2): Remove the last empty row
+ ExtractText: assign ID list to attribute named list_id
+UpdateAttribute: to specify name and value of property. The list contains 10 ID
+ ExecuteSQL:
There is no task In for UpdateAttribute, the flow stop at ExtractText, the detail tab in view data governance is below, it said File.Size is 0 bytes.
However it doesn't return any error so I don't know where to fix. Could you please help.
Thank you
Created 09-07-2024 07:58 AM
Hi @xtd ,
Welcome to the community. you definitely look like newbie :), but dont worry, as this is common when you start with Nifi but with time and experience you get better for sure.
There are many tutorials out there to help you get started that I would recommend going through to understand how nifi works and what are the best practices. However one crucial concept that every beginner needs to be familiar with early on can be found in series of youtube videos by Mark Payne who is one the creators of Nifi around Nifi Anti Pattern that I recommend you to watch.
I can see many issues with this flow. Above all its not doing it the nifi way. Here are some notes on some of the main issues:
1- When you work with formats like CSV, Json , xml ...etc. You need to take advantage of Nifi out of the box capabilities to parse the data and not doing that using replace text which could be inefficient.
2- Its not recommended to use ExtractText to extract the whole content of a flowfile into an attribute when you cant control how big the content can be. you will hear that a lot that attribute are stored in JVM heap which could be limited and expensive which can cause problem if you have large content.
3-When passing parameters to your SQL query through file attribute the N placeholder in sql.args.N.value stands for the parameter index in the query , for example if you the following insert statement:
insert into Table1 (col1, col2) values (?,?)
Then the processor expects two set of sql attributes where for the first arg (N=0 ) and the second (N=1) as in sq1.args.0.value, sql.args.1.value...etc.
The way I would design this flow depends on really what are you trying to do after the ExecuteSQL. The result of the ExecuteSQL if you pass in the list of ID's correctly and based on the the configuration will return a single flow file with the list of records matching the ID's in an Avro format, so what are you going to do with this result? My guess and based on similar scenarios you probably want to process each records on its own wither you want to do data enrichment or some API call ...etc. Based on that here is how I would design this flow:
1- Get the File from the source (GetFile, GetSFTP...etc.)
2- Use SplitRecord to to get every json record into singleflow and convert it into into format that is easy to parse and extract attributes using methods other than Regex in ExtractText, here is an example:
Im converting to Json because I can extract the ID using json path as we will see in the next step.
You can use the default setting for the reader and writer services.
3- User EvaluateJsonPath to extract the ID into an attribute. This way we know that the value we are going to extract is single integer value and we wont be concerned much about heap storing large content.
4- Fetch the record corresponding to each ID from SQL. I know you probably think that this is going to do sql transaction for each ID vs one select for all and maybe this is the pitfall of this design but those transaction are going to be short and simple and probably perform better specially when you have a cluster where do load balancing or you can increase the number of threads on the processor itself .
As I mentioned above the ExecuteSQL will give you the result in Avro format which you cant do much parsing with unless you convert to other parsable format (json, csv, xml...etc) using ConvertRecord , however you can save the the need for ConvertRecord by using ExecuteSQLRecord instead of ExecuteSQL where you can set the record writer to whatever format before sending the result to the success relationship. Lets assume I want to convert the result back to CSV, the configuration would be:
Once you get the record in the intended format you can do whatever needed afterward.
I think you have enough to get going. Let me know if you have more questions.
If you find this is helpful please accept the solution.
Thanks
Created 09-07-2024 07:58 AM
Hi @xtd ,
Welcome to the community. you definitely look like newbie :), but dont worry, as this is common when you start with Nifi but with time and experience you get better for sure.
There are many tutorials out there to help you get started that I would recommend going through to understand how nifi works and what are the best practices. However one crucial concept that every beginner needs to be familiar with early on can be found in series of youtube videos by Mark Payne who is one the creators of Nifi around Nifi Anti Pattern that I recommend you to watch.
I can see many issues with this flow. Above all its not doing it the nifi way. Here are some notes on some of the main issues:
1- When you work with formats like CSV, Json , xml ...etc. You need to take advantage of Nifi out of the box capabilities to parse the data and not doing that using replace text which could be inefficient.
2- Its not recommended to use ExtractText to extract the whole content of a flowfile into an attribute when you cant control how big the content can be. you will hear that a lot that attribute are stored in JVM heap which could be limited and expensive which can cause problem if you have large content.
3-When passing parameters to your SQL query through file attribute the N placeholder in sql.args.N.value stands for the parameter index in the query , for example if you the following insert statement:
insert into Table1 (col1, col2) values (?,?)
Then the processor expects two set of sql attributes where for the first arg (N=0 ) and the second (N=1) as in sq1.args.0.value, sql.args.1.value...etc.
The way I would design this flow depends on really what are you trying to do after the ExecuteSQL. The result of the ExecuteSQL if you pass in the list of ID's correctly and based on the the configuration will return a single flow file with the list of records matching the ID's in an Avro format, so what are you going to do with this result? My guess and based on similar scenarios you probably want to process each records on its own wither you want to do data enrichment or some API call ...etc. Based on that here is how I would design this flow:
1- Get the File from the source (GetFile, GetSFTP...etc.)
2- Use SplitRecord to to get every json record into singleflow and convert it into into format that is easy to parse and extract attributes using methods other than Regex in ExtractText, here is an example:
Im converting to Json because I can extract the ID using json path as we will see in the next step.
You can use the default setting for the reader and writer services.
3- User EvaluateJsonPath to extract the ID into an attribute. This way we know that the value we are going to extract is single integer value and we wont be concerned much about heap storing large content.
4- Fetch the record corresponding to each ID from SQL. I know you probably think that this is going to do sql transaction for each ID vs one select for all and maybe this is the pitfall of this design but those transaction are going to be short and simple and probably perform better specially when you have a cluster where do load balancing or you can increase the number of threads on the processor itself .
As I mentioned above the ExecuteSQL will give you the result in Avro format which you cant do much parsing with unless you convert to other parsable format (json, csv, xml...etc) using ConvertRecord , however you can save the the need for ConvertRecord by using ExecuteSQLRecord instead of ExecuteSQL where you can set the record writer to whatever format before sending the result to the success relationship. Lets assume I want to convert the result back to CSV, the configuration would be:
Once you get the record in the intended format you can do whatever needed afterward.
I think you have enough to get going. Let me know if you have more questions.
If you find this is helpful please accept the solution.
Thanks
Created 10-16-2024 02:59 PM
@SAMSAL thank you so much for your detailed answer, I really appreciated (and sorry for not able to reply earlier)