I’m constantly amazed by what powerful things I can do with Apache NiFi in such few steps. I often challenge myself by saying “self, I bet you couldn’t do X with NiFi”. My confidence was challenged yesterday on a long flight back from Peru to Atlanta when I realized I couldn’t perform OCR type tasks with NiFi as it stands today. Perturbed by this fact I set out to come up with a solution. Ultimately this lead me to create a NiFi Tesseract processor for performing OCR tasks natively from within Apache NiFi. It wasn’t really until I was finished that I realized the how useful this processor could be. The Apache Tesseract Processor would give me the ability to read anything from hand written doctors notes from healthcare systems to interpreting scanned children’s book images.
In fact I chose to demonstrate the later by showing how to use Apache NiFi to perform OCR on an excerpt from Dr. Seuss's - "Cat in the Hat” and then feeding that resulting text from the NiFi Tesseract processor to the Mac OS X “say” command to read the output. I have included a screen recording session that shows the Apache NiFi reading in a page from Cat in the Hat and then reading the results.