Code Repositories
Find and share code repositories
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
New Contributor
Repo Description

This project helps to get over the inefficiency of processing multiple small files in Hadoop. Moreover, it allows for processing and analysis of binary documents in Hadoop using Apache Tika by integrating it in a MapReduce job.

Repo Info
Github Repo URL https://github.com/ppruski/tika-hadoop-mapreduce
Github account name ppruski
Repo name tika-hadoop-mapreduce
2,286 Views
Comments
Cloudera Employee

Hi Piotr, great idea to share this repo, but I'm wondering if there is a way to expand/edit this post by putting a brief description of the use-case, i.e. if I don't know what Tika is, what would cause this post to be found in my search? (essentially "processing and analysis of binary documents" is more vague than describing how Tika would assist with: OCR, Full-text, text scan, recognition, imagery, etc.)

Just my thoughts... Thanks!

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎11-22-2015 06:39 PM
Updated by:
 
Contributors
Top Kudoed Authors