Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NiFi: How to extract HTML elements innerHTML text from a html flow file. (Web scraping using Apache NiFi)

Highlighted

NiFi: How to extract HTML elements innerHTML text from a html flow file. (Web scraping using Apache NiFi)

New Contributor

I am using GetHTTP processor to get a html from the given url.

 

Screen Shot 2020-09-06 at 5.36.04 PM.png

Screen Shot 2020-09-06 at 5.49.46 PM.png

I want to grab the innerHTML text of a particular class selector. I tried to achieve the same by using GetHTMLElement but this GetHTMLElement processor is giving N number of flowFiles. The selector what I am passing has only once in a html but I am getting n number of times. does anyone know why?

In the below screenshot first 10 entries are unique and remaining redundant. 

 

Screen Shot 2020-09-06 at 5.36.29 PM.pngScreen Shot 2020-09-06 at 5.48.03 PM.pngScreen Shot 2020-09-06 at 5.37.12 PM.pngScreen Shot 2020-09-06 at 5.37.24 PM.png 

 

My requirement is to fetch a web page html and extract required innerHTML text by passing selector or XPath and then make a JSON and insert into mongoDB.

 

I am facing issue while extracting innerHTML text by passing css selector. Please help me to solve this. 

How can i achieve this. I have searched a lot but didn’t get any proper solution. I would really appreciate if you help me. 

2 REPLIES 2
Highlighted

Re: NiFi: How to extract HTML elements innerHTML text from a html flow file. (Web scraping using Apache NiFi)

Contributor

Hi @Manoj90 ,

Please terminate original relation from GetHTMLElement. The reason you are getting so many flowFiles is as follows:

On the first run, GetHTMLElement processor is returning 10 flowFiles to the "success" relation. As the operation was successful, the input flowFile(10.79 KB) will be routed to original relation as well. Then, you have the same file which was input for GetHTMLElement processor. Again, it starts processing and sends 10 flowFiles to success relation and the original(input) flowFile to original relation.

This loop will continue infinitely till you get memory issue.

Re: NiFi: How to extract HTML elements innerHTML text from a html flow file. (Web scraping using Apache NiFi)

@Manoj90 In addition o @PVVK point, you need to be careful with routing relationships back on the originating processor.  During development I like to use a output port that I call End of Line or EOL1, EOL2, EOL3 as I need more in larger flows.   This is to evaluate if something goes to fail, retry, etc.  Later once I am certain the flow is working as I need, I either auto terminate these routes, or I route them out of my processor group to an Event Notification system.  

 

It looks like this:

 

Using an output port to hold un needed routes during testingUsing an output port to hold un needed routes during testing

 

 

If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.  

 

Thanks,


Steven

 


 


If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.  


 


Thanks,



Steven

Don't have an account?
Coming from Hortonworks? Activate your account here