Created on 09-06-2020 06:44 AM - edited 09-06-2020 06:50 AM
I am using GetHTTP processor to get a html from the given url.
I want to grab the innerHTML text of a particular class selector. I tried to achieve the same by using GetHTMLElement but this GetHTMLElement processor is giving N number of flowFiles. The selector what I am passing has only once in a html but I am getting n number of times. does anyone know why?
In the below screenshot first 10 entries are unique and remaining redundant.
My requirement is to fetch a web page html and extract required innerHTML text by passing selector or XPath and then make a JSON and insert into mongoDB.
I am facing issue while extracting innerHTML text by passing css selector. Please help me to solve this.
How can i achieve this. I have searched a lot but didn’t get any proper solution. I would really appreciate if you help me.
Created 09-30-2020 01:19 AM
Hi @Manoj90 ,
Please terminate original relation from GetHTMLElement. The reason you are getting so many flowFiles is as follows:
On the first run, GetHTMLElement processor is returning 10 flowFiles to the "success" relation. As the operation was successful, the input flowFile(10.79 KB) will be routed to original relation as well. Then, you have the same file which was input for GetHTMLElement processor. Again, it starts processing and sends 10 flowFiles to success relation and the original(input) flowFile to original relation.
This loop will continue infinitely till you get memory issue.
Created 09-30-2020 05:03 AM
@Manoj90 In addition o @PVVK point, you need to be careful with routing relationships back on the originating processor. During development I like to use a output port that I call End of Line or EOL1, EOL2, EOL3 as I need more in larger flows. This is to evaluate if something goes to fail, retry, etc. Later once I am certain the flow is working as I need, I either auto terminate these routes, or I route them out of my processor group to an Event Notification system.
It looks like this:
If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.
Thanks,
Steven