Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Extract data from scanned image and pdf

Extract data from scanned image and pdf

New Contributor

Please let us know how to extract data from scanned image and pdf.

2 REPLIES 2
Highlighted

Re: Extract data from scanned image and pdf

Guru

You can use tika and tess4j. tika works well for PDFs that are exported from word doc or other documents. For scanned images, tess4j which uses tesseract gives better text extraction output.

Once you know how to extract text, depending on how many pdfs land and at what ingest rate, you can choose to use either Nifi (you need a custom processer) or mapreduce job that calls your code in parallel.

Re: Extract data from scanned image and pdf

New Contributor

Thanks @Ravi Mutyala, If possible can you please provide any document or link to achieve the same.