I understand I need to use ExtractText but it does not output anything to executeSQL processor. How do I do that?
I have not added anything additional in the config of the ExtractText processor but mapped both matched and unmatched queue to the executSQL processor.
WHat do I pass in SQL select query in executeSQL processor?
You don`t need to Extract the text .
Your flow should look like this
GenerateTableFetch --> ExecuteSQL -->ConvertAvroToJson(optional) --> StoreResults
Where Your GenerateTableFetch conf would be:
And the ExecuteSQL is:
You would get a Json/Avro Output Format:
Then you can work on you Json file to extract it
you can do this in 2 methods
1. by using GenerateTableFetch Processor
2. by using QueryDatabaseTable Processor
GenerateTableFetch processor gives incremental sql statements based on partition size
In my configuration i have given PartitionSize 10000 so the output of the table fetch processor will be
SELECT * FROM Table_Name ORDER BY Incremental_column_name LIMIT 10000
Drag and drop an executesql processor and don't mention any sql statements in it, as this processor executes all the sql statements that got generated by GenerateTableFetch processor.
In nifi every executesql,querydatabase processor results default format is avro.
QueryDatabaseTable Processor does the both two steps(generating table fetch and executing sql) and stores the state as GenerateTableFetch processor does and as output we will have data in Avro format(default in NiFi)
As a result from both the methods above we will have Avro format data and having not more than 10000(we mentioned in Processor) records in them.
We cannot extract text from Avro format so we need to convert avro format to Json.
in new version of NiFi there is Convert record processor which will convert Avro format data to CSV directly but in my version i dont have this processor.
This processor will converts all the avro to json data once we get json data then you can store the json data directly to HDFS... etc.
If you are thinking to store the data as text format not as Json then you need to Split the json array as individual records
Split Array using SplitJson Processor:-
We are having array of json message so use splitjson processor and property for JsonPath Expression as
It will splits the json array to individual records so each flowfile will have one record data in it.
Input to SplitJson:-
Once we splitted all the json array then use
EvaluateJsonPath processor configs:-
use the Destination property to flowfile-attribute,then add the list required of attributes that you need to be extracted as attributes from the content.
in this screenshot i am extracting all the contents as attributes to the flow file, but you can extract only the required content values as attributes to the flow file.
Use ReplaceText Processor:-
in this processor keep all the attributes in it to extract only the values of the attributes. we are having all the listed attributes below as flowfile attributes so i am extracting the values from flowfile.
change the Replacement Strategy property to AlwaysReplace.
Output from this processor:-
Right now we dont have any json data, if you want to merge the contents of flow files then use
MergeContent processor :-
Configure this processor to how many flowfiles you want to merge, increase Minimum Group Size to some number Default it is 0 B then this processor will wait to reach Minimum group size and merges all the contents flowfiles into one.
MergeContent causes memory issues follow instructions in the below link, there are lot of questions on community you can refer to them before configuring this processor
* i have reached max attachements for this answer,i'm unable to attach the flow screenshot,i'm going to attach flow ss in comments*.
Hope this Helps...!
This flow demonstrates how to do things with both method1 and method2. in place of update attributes in the screenshot you can use PutHDFS,PutS3..etc.
Keep in mind don't connect all the available relations(like success,failure,matched,unmatched,etc) to the next processor, take the above screenshot as reference just connect the required relations to next processor and for other relations that are not shown in this screenshot Auto terminate them in processor Settings.
If you don't care about Avro format and want to go directly to JSON you can use the ExecuteSQLRecord processor where you can specify the output format.