Code Repositories

Find and share code repositories
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.
Labels (1)
Cloudera Employee
Repo Description

This project helps to get over the inefficiency of processing multiple small files in Hadoop. Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in a MapReduce job.

Repo Info
Github Repo URL https://github.com/ppruski/tika-hadoop-mapreduce
Github account name ppruski
Repo name tika-hadoop-mapreduce
3,415 Views
Comments
Cloudera Employee

Hi Piotr, great idea to share this repo, but I'm wondering if there is a way to expand/edit this post by putting a brief description of the use-case, i.e. if I don't know what Tika is, what would cause this post to be found in my search? (essentially "processing and analysis of binary documents" is more vague than describing how Tika would assist with: OCR, Full-text, text scan, recognition, imagery, etc.)

Just my thoughts... Thanks!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.
Version history
Last update:
‎11-22-2015 06:39 PM
Updated by:
Contributors