One of the search use cases that I’ve been introduced to would
require the ability to index text such as scanned text in png files. I set out
to figure out how to do this with SOLR. I came across a couple pretty good blog
posts, but as usual, you have to put together what you learn from multiple
sources before you can get things to work correctly (or at least that’s what
usually happens for me). So I thought I would put together the steps I took to
get it to work.
I used HDP Sandbox 2.3.
guideInstall dependencies - this will provide you support for processing pngs, jpegs, and tiffs
Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client.
Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir.
[root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[root@sandbox tesseract-ocr]# cat ~/OM_out.txt
‘ '"I“ " "' ./lrast. Shortly before the classes started I was visiting a.certain public school, a school set in a typically Englishcountryside, which on the June clay of my visit was wonder-fully beauliful. The Head Master—-no less typical than hisschool and the country-side—pointed out the charms ofboth, and his pride came out in the ?nal remark which he madebeforehe left me. He explained that he had a class to takein'I'heocritus. Then (with a. buoyant gesture); “ Can you, conceive anything more delightful than a class in Theocritus,on such a day and in such a place?"
If you have text in your out file, then you’ve done it correctly!
Try it again
Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id
Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know :)