One of the search use cases that I’ve been introduced to would require the ability to index text such as scanned text in png files. I set out to figure out how to do this with SOLR. I came across a couple pretty good blog posts, but as usual, you have to put together what you learn from multiple sources before you can get things to work correctly (or at least that’s what usually happens for me). So I thought I would put together the steps I took to get it to work.
I used HDP Sandbox 2.3.
Install dependencies - this will provide you support for processing pngs, jpegs, and tiffs
Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client.
Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir.
Test Tesseract – Use the image in this blog post. You’ll notice that this is where I started. The ‘hard’ part of this was getting the builds correct for leptonica. And the problem there was ensuring that I had the correct dependencies installed and that they were available on the path defined above. If this doesn’t work, there’s no sense moving on to SOLR.