I am using the /update/extract request handler to push documents into solr. I am getting this error with certain types of documents. These documents are ended up being ignored by Solr.
I have discovered that these files are Emails (.msg) with zip files containing unsupported documents (im assuming). Is there a way to have solr ignore the zip file rather than ignoring the entire file itself?
This question has a "nifi-processor" tag, which NiFi processor are you using? Also which processor(s) are you using to get the email messages? I suspect you should be able to use RouteOnAttribute or RouteOnContent to send emails with ZIP attachments to some other relationship, and those without attachments can go directly to PutSolrContentStream (or whatever you're using to push data to Solr). Perhaps the branch with ZIP attachments can use processor(s) to remove the ZIP part of the attachment, retain the email message, and route back to the "main" branch to retry the "put".
I'm using the PutSolrContentStream Processor. Solr is only failes on certain extension type (mdb for example). When an email or a zip file contains an mdb file, the entire document fails to get pushed to solr. Is there a way to have solr index the email or zip file and ignore only the unsupported extensions rather than ignoring the entire document?
I believe this is a known issue with .zip archives and the Solr ExtractingRequestHandler (aka Solr Cell): https://issues.apache.org/jira/browse/SOLR-2416. The short version of the story is that Tika in this case is not configured to parse the .zip recursively.
One of the other suggestions for NiFi processing may be worth exploring in this case.
i have tried sending documents using Solr's rest api and i got the exact same error. The problem isn't with zip files. If a zip file contains pdf or word documents for example the zip is indexed well. However if the zip file contains an mdb file solr fails to index it. Is it possible to have solr ignore only the unsupported extensions rather than ignoring the entire document or file?