Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Error sending *.docx file from HDFS to SOLR

avatar
Rising Star

Hello, I have written java Map-only program for sending files from folder on HDFS to Solr. It is working fine for all files, except *.docx. (program v1).

I also modified this program to run as simple java program without hadoop. It takes files from ext3 filesystem and sends to Solr. (program v2).

 

Using program v2 I was able to send my *.docx file to solr, but if I put it in HDFS and start MapReduce program (v1) i can't index it. I get such error::

 

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@42e20459 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1909) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:739) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:169) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.ProxyUserFilter.doFilter(ProxyUserFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:140) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:384) at org.apache.solr.servlet.SolrHadoopAuthenticationFilter.doFilter(SolrHadoopAuthenticationFilter.java:145) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.HostnameFilter.doFilter(HostnameFilter.java:86) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@42e20459 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 30 more Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: '/var/lib/solr/apache-tika-242864397475446795.tmp' at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:103) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:207) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:70) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 33 more

 

Also in folder /var/lib/solr/  there are no file apache-tika-242864397475446795.tmp

Any suggestions?

 

 

1 ACCEPTED SOLUTION

avatar
Rising Star

Resolved. 

 

I was using WholeFileInputFormat and used value.getBytes() instead of value.copyBytes().

 

 

View solution in original post

1 REPLY 1

avatar
Rising Star

Resolved. 

 

I was using WholeFileInputFormat and used value.getBytes() instead of value.copyBytes().