Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Rising Star

One of the search use cases that I’ve been introduced to would require the ability to index text such as scanned text in png files. I set out to figure out how to do this with SOLR. I came across a couple pretty good blog posts, but as usual, you have to put together what you learn from multiple sources before you can get things to work correctly (or at least that’s what usually happens for me). So I thought I would put together the steps I took to get it to work.

I used HDP Sandbox 2.3.

Step-by-step guide

  1. Install dependencies - this will provide you support for processing pngs, jpegs, and tiffs

    yum install autoconf automake libtool

    yum install libpng-devel

    yum install libjpeg-devel

    yum install libtiff-devel

    yum install zlib-devel

  2. Download Leptonica, an image processing library

    wget http://www.leptonica.org/source/leptonica-1.69.tar.gz

  3. Download Tesseract, an Optical Character Recognition engine

    wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz

  4. Ensure proper variables and pathing are set – This is necessary so that when building leptonica, the build can find the dependencies that you installed earlier. If this pathing is not correct, you will get Unsupported image type errors when running tesseract command line client.

    Also, when installing tesseract, you will place language data at TESSDATA_PREFIX dir.

    [root@sandbox tesseract-ocr]# cat ~/.profile

    export TESSDATA_PREFIX='/usr/local/share/'

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64

  5. Build leptonica

    tar xvf leptonica-1.69.tar.gz

    cd leptonica-1.69

    ./configure

    make

    sudo make install

  6. Build Tesseract

    tar xvf tesseract-ocr-3.02.02.tar.gz

    cd tesseract-ocr

    ./autogen.sh

    ./configure

    make

    sudo make install

    sudo ldconfig

  7. Download tesseract language(s) and place them in TESSDATA_PREFIX dir, defined above

    wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz

    tar xzf tesseract-ocr-3.02.eng.tar.gz

    cp tesseract-ocr/tessdata/* /usr/local/share/tessdata

  8. Test Tesseract – Use the image in this blog post. You’ll notice that this is where I started. The ‘hard’ part of this was getting the builds correct for leptonica. And the problem there was ensuring that I had the correct dependencies installed and that they were available on the path defined above. If this doesn’t work, there’s no sense moving on to SOLR.

    http://blog.thedigitalgroup.com/vijaym/2015/07/17/using-solr-and-tikaocr-to-search-text-inside-an-im...

    [root@sandbox tesseract-ocr]# /usr/local/bin/tesseract ~/OM_1.jpg ~/OM_out

    Tesseract Open Source OCR Engine v3.02.02 with Leptonica

    [root@sandbox tesseract-ocr]# cat ~/OM_out.txt

    ‘ '"I“ " "' ./

    lrast. Shortly before the classes started I was visiting a.

    certain public school, a school set in a typically English

    countryside, which on the June clay of my visit was wonder-

    fully beauliful. The Head Master—-no less typical than his

    school and the country-side—pointed out the charms of

    both, and his pride came out in the ?nal remark which he made

    beforehe left me. He explained that he had a class to take

    in'I'heocritus. Then (with a. buoyant gesture); “ Can you

    , conceive anything more delightful than a class in Theocritus,

    on such a day and in such a place?"

    If you have text in your out file, then you’ve done it correctly!

  9. Start Solr Sample – This sample contains the Proper Extracting Request Handler for processing with tika

    https://wiki.apache.org/solr/ExtractingRequestHandler

    cd /opt/lucidworks-hdpsearch/solr/bin/

    ./solr -e dih

  10. Use SOLR Admin to upload the image

    Go back to the blog post or to the RequestHandler page for the proper update/extract command syntax.

    1. From SOLR admin, select the tika core.
    2. Click Documents
    3. In the Request-Handler (qt) field, enter /update/extract
    4. In the Document Type drop down, select File Upload
    5. Choose the png file
    6. In the Extracting Req. Handler Params box, type the following:

    literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true

    Understanding all the parameters is another process, but the literal.id is the unique id for the document. For more information on this command, start by reviewing https://wiki.apache.org/solr/ExtractingRequestHandler and then the SOLR documentation.

  11. Run a query
    1. From SOLR admin, select tika core.
    2. Click Query.
    3. In the q field, type attr_content:explained
    4. Execute the query.

    http://sandbox.hortonworks.com:8983/solr/tika/select?q=attr_content%3Aexplained&wt=json&indent=true

  12. Try it again

    Use another png or supported file type. Be sure to use the same Request Handler Params, except provide a new unique literal.id

    Note, that the attr_content is a dynamic field, and it cannot be highlighted. If you figure out how to add an indexed and stored field to hold the image text, let me know 🙂

3,620 Views
Comments
avatar
Super Collaborator

great article.

avatar
Master Mentor

This needs to be an official blog

webinar banner
Version history
Last update:
‎09-16-2022 01:32 AM
Updated by:
Contributors
meetups banner