Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

what is the best way to process word documents?

what is the best way to process word documents?

New Contributor

I have large number of word documents in one folder. what is the best way to run analytics on it? shall i add all document to Hadoop supported zip format and push to HDFS? this folder will keep on updating by adding new documents. so there will be requirement to update HDFS with new data. Any possiblity to use SPARK on it?

7 REPLIES 7

Re: what is the best way to process word documents?

You could store your data in HDFS and use SolrCloud as well as the Hadoop Job Jar to index the documents. The content and maybe some metadata (title, author, date created,...) would be indexed in Solr and could be queried afterwards, either with Solr directly or by using Spark (https://github.com/LucidWorks/spark-solr, this requires a SolrCloud)

Regarding Solr: You can index documents that are stored in HDFS by using:

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jarcom.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dlww.jaas.file=jaas.conf -clscom.lucidworks.hadoop.ingest.DirectoryIngestMapper --collection MyCollection -i hdfs://hortoncluster/staging/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --zkConnect horton01.example.com:2181,horton02.example.com:2181,horton03.example.com:2181/solr

What sort of analytics do you want to run?

Re: what is the best way to process word documents?

New Contributor

@Jonas Straub

Thanks for your response. These documents are in pre-defined template and have information like customer name, cost etc.

Re: what is the best way to process word documents?

what sort of pre-defined template is this? like a word design template?

I mean if you have a predefined template you might be able to map the template fields (customer, cost, etc.) to Solr fields and later analyze them. Not sure how Solr handles your templates, you could try indexing a couple files with a single Solr instance.

Word (at least the newer documents) are just xml files, so maybe its worth a shot to unzip a word document and take a closer look at the plain xml file (unzip something.docx, than go to word->document.xml) to see how your template is mapped/saved to the word-doc-format

Re: what is the best way to process word documents?

Re: what is the best way to process word documents?

New Contributor

@Jonas Straub

From template, i mean to say MS Word document which has pre defined format. I am not sure how solr can help here. What i was thinking, add all documents to a hadoop supported zip format and push it to HDFS. I can write some Map Reduce for it and do the analytics. but how to push newly added documents to HDFS?

Re: what is the best way to process word documents?

You can push new documents to hdfs by using "hdfs dfs -copyFromLocal <source path> <target path>", you dont have to zip the files.

This might also be helpful https://community.hortonworks.com/questions/6389/solr-how-to-index-rich-text-document-put-in-hdfs.ht...

Re: what is the best way to process word documents?

Mentor

@Sumit Agarwal are you still having issues with this? Can you accept best answer or provide your own solution?