This project helps to get over the inefficiency of processing multiple small files in Hadoop. Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in a MapReduce job.
Hi Piotr, great idea to share this repo, but I'm wondering if there is a way to expand/edit this post by putting a brief description of the use-case, i.e. if I don't know what Tika is, what would cause this post to be found in my search? (essentially "processing and analysis of binary documents" is more vague than describing how Tika would assist with: OCR, Full-text, text scan, recognition, imagery, etc.)