I have large number of word documents in one folder. what is the best way to run analytics on it? shall i add all document to Hadoop supported zip format and push to HDFS? this folder will keep on updating by adding new documents. so there will be requirement to update HDFS with new data. Any possiblity to use SPARK on it?
You could store your data in HDFS and use SolrCloud as well as the Hadoop Job Jar to index the documents. The content and maybe some metadata (title, author, date created,...) would be indexed in Solr and could be queried afterwards, either with Solr directly or by using Spark (https://github.com/LucidWorks/spark-solr, this requires a SolrCloud)
Regarding Solr: You can index documents that are stored in HDFS by using:
hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jarcom.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dlww.jaas.file=jaas.conf -clscom.lucidworks.hadoop.ingest.DirectoryIngestMapper --collection MyCollection -i hdfs://hortoncluster/staging/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --zkConnect horton01.example.com:2181,horton02.example.com:2181,horton03.example.com:2181/solr
What sort of analytics do you want to run?
Thanks for your response. These documents are in pre-defined template and have information like customer name, cost etc.
what sort of pre-defined template is this? like a word design template?
I mean if you have a predefined template you might be able to map the template fields (customer, cost, etc.) to Solr fields and later analyze them. Not sure how Solr handles your templates, you could try indexing a couple files with a single Solr instance.
Word (at least the newer documents) are just xml files, so maybe its worth a shot to unzip a word document and take a closer look at the plain xml file (unzip something.docx, than go to word->document.xml) to see how your template is mapped/saved to the word-doc-format
From template, i mean to say MS Word document which has pre defined format. I am not sure how solr can help here. What i was thinking, add all documents to a hadoop supported zip format and push it to HDFS. I can write some Map Reduce for it and do the analytics. but how to push newly added documents to HDFS?
You can push new documents to hdfs by using "hdfs dfs -copyFromLocal <source path> <target path>", you dont have to zip the files.
This might also be helpful https://community.hortonworks.com/questions/6389/solr-how-to-index-rich-text-document-put-in-hdfs.ht...