Created on 09-21-2015 05:32 PM - edited 09-16-2022 01:32 AM
One of the search use cases that I’ve been introduced to would require the ability to index text such as scanned text in png files. I set out to figure out how to do this with SOLR. I came across a couple pretty good blog posts, but as usual, you have to put together what you learn from multiple sources before you can get things to work correctly (or at least that’s what usually happens for me). So I thought I would put together the steps I took to get it to work.
I used HDP Sandbox 2.3.
yum install autoconf automake libtool
yum install libpng-devel
yum install libjpeg-devel
yum install libtiff-devel
yum install zlib-devel
wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir.
[root@sandbox tesseract-ocr]# cat ~/.profile
export TESSDATA_PREFIX='/usr/local/share/'
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64
tar xvf leptonica-1.69.tar.gz
cd leptonica-1.69
./configure
make
sudo make install
tar xvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
tar xzf tesseract-ocr-3.02.eng.tar.gz
cp tesseract-ocr/tessdata/* /usr/local/share/tessdata
[root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[root@sandbox tesseract-ocr]# cat ~/OM_out.txt
‘ '"I“ " "' ./
lrast. Shortly before the classes started I was visiting a.
certain public school, a school set in a typically English
countryside, which on the June clay of my visit was wonder-
fully beauliful. The Head Master—-no less typical than his
school and the country-side—pointed out the charms of
both, and his pride came out in the ?nal remark which he made
beforehe left me. He explained that he had a class to take
in'I'heocritus. Then (with a. buoyant gesture); “ Can you
, conceive anything more delightful than a class in Theocritus,
on such a day and in such a place?"
If you have text in your out file, then you’ve done it correctly!
https://wiki.apache.org/solr/ExtractingRequestHandler
cd /opt/lucidworks-hdpsearch/solr/bin/
./solr -e dih
Go back to the blog post or to the RequestHandler page for the proper update/extract command syntax.
literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true
Understanding all the parameters is another process, but the literal.id is the unique id for the document. For more information on this command, start by reviewing https://wiki.apache.org/solr/ExtractingRequestHandler and then the SOLR documentation.
http://sandbox.hortonworks.com:8983/solr/tika/select?q=attr_content%3Aexplained&wt=json&indent=true
Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id
Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know 🙂
Created on 09-21-2015 05:44 PM
great article.
Created on 10-29-2015 10:01 AM
This needs to be an official blog