Support Questions
Find answers, ask questions, and share your expertise

How to perform ETL on Word Documents using Hadoop?

New Contributor

I need to know the process to perform ETL on Word Documents, and the documents are supposed to have unstructured data. I've looked through articles on Hadoop but didn't find a way I could process Word Documents with it. Sources only mentioned processing text files that had structured data in them. So they didn't help much.


I also found I could use the tools called Apache Solr and Hadoop HDFS for this. Still, I couldn't find any resource that explained the exact process to perform ETL on unstructured Word Documents using those tools either.


If someone knows how to do this, please guide me through the process. It would be a huge help if someone could post a reference at least.

; ;