Reply
Highlighted
Contributor
Posts: 25
Registered: ‎10-11-2013
Accepted Solution

Error sending *.docx file from HDFS to SOLR

Hello, I have written java Map-only program for sending files from folder on HDFS to Solr. It is working fine for all files, except *.docx. (program v1).

I also modified this program to run as simple java program without hadoop. It takes files from ext3 filesystem and sends to Solr. (program v2).

 

Using program v2 I was able to send my *.docx file to solr, but if I put it in HDFS and start MapReduce program (v1) i can't index it. I get such error::

 

org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@42e20459 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1909) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:739) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:169) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.ProxyUserFilter.doFilter(ProxyUserFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:140) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:384) at org.apache.solr.servlet.SolrHadoopAuthenticationFilter.doFilter(SolrHadoopAuthenticationFilter.java:145) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.solr.servlet.HostnameFilter.doFilter(HostnameFilter.java:86) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@42e20459 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 30 more Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: '/var/lib/solr/apache-tika-242864397475446795.tmp' at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:103) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:207) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:70) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 33 more

 

Also in folder /var/lib/solr/  there are no file apache-tika-242864397475446795.tmp

Any suggestions?

 

 

Contributor
Posts: 25
Registered: ‎10-11-2013

Re: Error sending *.docx file from HDFS to SOLR

Resolved. 

 

I was using WholeFileInputFormat and used value.getBytes() instead of value.copyBytes().

 

 

Announcements