I wanted to be able to grab every image from a page, I have some web sites I want to backup my images from. So I added a processor that uses JSoup to do.
Once you download the NAR from github and deploy to your /usr/hdf/current/nifi/lib directories and restart Apache NiFi you will have a new processor. It is ImageProcessor listed version 1.6.0.
You can examine and test the Java source code if you wish.
Here is an example flow of grabbing all the images from a pixabay URL then filtering out the empty images. Then we split into individual image URLs. We pull out that tag and then download those images. If they are not blank or small I route to TensorFlow to run some inception on it. I extract image meta data and then we send it to my production cluster for processing and storing of the image in an object store and the meta data to a Hive table.
Our Routing To Filter Away Small and blank images
Pretty basic flow to process. I use my custom Attribute Cleaner to clean up the names and make all the attribute names Apache Avro name compliant.
Some of the useful metadata pulled from the image. See the Height and Width, very useful.