Code Repositories
Find and share code repositories
Labels (1)
Cloudera Employee
Repo Description

This project helps to get over the inefficiency of processing multiple small files in Hadoop. Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in a MapReduce job.

Repo Info
Github Repo URL https://github.com/ppruski/tika-hadoop-mapreduce
Github account name ppruski
Repo name tika-hadoop-mapreduce
2,954 Views
Comments
Cloudera Employee

Hi Piotr, great idea to share this repo, but I'm wondering if there is a way to expand/edit this post by putting a brief description of the use-case, i.e. if I don't know what Tika is, what would cause this post to be found in my search? (essentially "processing and analysis of binary documents" is more vague than describing how Tika would assist with: OCR, Full-text, text scan, recognition, imagery, etc.)

Just my thoughts... Thanks!