Support Questions

Find answers, ask questions, and share your expertise

Extract data from scanned image and pdf

Please let us know how to extract data from scanned image and pdf.



You can use tika and tess4j. tika works well for PDFs that are exported from word doc or other documents. For scanned images, tess4j which uses tesseract gives better text extraction output.

Once you know how to extract text, depending on how many pdfs land and at what ingest rate, you can choose to use either Nifi (you need a custom processer) or mapreduce job that calls your code in parallel.

Thanks @Ravi Mutyala, If possible can you please provide any document or link to achieve the same.