Created 12-20-2016 12:55 PM
I am using the PutSolrContentStream processor to push emails (.MSG) into my solrcloud. I have put "/update/extract" in the Content Stream Path property in order to extract fields from the msg file using the tika parser. All the fields associated with the emails have been extracted (ex: From, To, CC, Subject etc..) with the exception of the body of the email.
How can i have the processor push the body of the email as well? I am able to extract the content of the email and the meta-data programmatically using SolrNet library. How can i do so as well using the PutSolrContentStream processor?
Created 12-20-2016 02:54 PM
Pretend you didn't have NiFi, and all you had was Solr, how would you do it?
NiFi is not doing anything special here, it is streaming the content of the flow file (your .msg files) to Solr's /update/extract handler which is doing the extraction. This would be the same as you doing a curl command from a terminal to post a .msg file to Solr.
Reading Solr's documentation for the request handler (https://wiki.apache.org/solr/ExtractingRequestHandler), it says...
You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@tutorial.html"
Created 12-20-2016 02:54 PM
Pretend you didn't have NiFi, and all you had was Solr, how would you do it?
NiFi is not doing anything special here, it is streaming the content of the flow file (your .msg files) to Solr's /update/extract handler which is doing the extraction. This would be the same as you doing a curl command from a terminal to post a .msg file to Solr.
Reading Solr's documentation for the request handler (https://wiki.apache.org/solr/ExtractingRequestHandler), it says...
You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@tutorial.html"
Created 12-20-2016 04:01 PM
Thank you Bryan. I tried searching on the body of the email and i got results. I was under the impression that we can't search on fields that are not stored.
Created 12-20-2016 04:09 PM
There are two concepts, "indexed" and "stored"...
Created 12-20-2016 04:20 PM
Can i update the _text_ field and have it stored and indexed? Also is it possible to update the values and rename the fields using the PutContentStream processor? I am want to be able to store the location of the file im pulling from HDFS in a field in Solr.