Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PutSolrContentStream extract Email Fields

Solved Go to solution
Highlighted

PutSolrContentStream extract Email Fields

Expert Contributor

I am using the PutSolrContentStream processor to push emails (.MSG) into my solrcloud. I have put "/update/extract" in the Content Stream Path property in order to extract fields from the msg file using the tika parser. All the fields associated with the emails have been extracted (ex: From, To, CC, Subject etc..) with the exception of the body of the email.

How can i have the processor push the body of the email as well? I am able to extract the content of the email and the meta-data programmatically using SolrNet library. How can i do so as well using the PutSolrContentStream processor?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: PutSolrContentStream extract Email Fields

Pretend you didn't have NiFi, and all you had was Solr, how would you do it?

NiFi is not doing anything special here, it is streaming the content of the flow file (your .msg files) to Solr's /update/extract handler which is doing the extraction. This would be the same as you doing a curl command from a terminal to post a .msg file to Solr.

Reading Solr's documentation for the request handler (https://wiki.apache.org/solr/ExtractingRequestHandler), it says...

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@tutorial.html"
  • The uprefix=attr_ param causes all generated fields that aren't defined in the schema to be prefixed with attr_ (which is a dynamic field that is stored).
  • The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.

View solution in original post

4 REPLIES 4
Highlighted

Re: PutSolrContentStream extract Email Fields

Pretend you didn't have NiFi, and all you had was Solr, how would you do it?

NiFi is not doing anything special here, it is streaming the content of the flow file (your .msg files) to Solr's /update/extract handler which is doing the extraction. This would be the same as you doing a curl command from a terminal to post a .msg file to Solr.

Reading Solr's documentation for the request handler (https://wiki.apache.org/solr/ExtractingRequestHandler), it says...

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@tutorial.html"
  • The uprefix=attr_ param causes all generated fields that aren't defined in the schema to be prefixed with attr_ (which is a dynamic field that is stored).
  • The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.

View solution in original post

Highlighted

Re: PutSolrContentStream extract Email Fields

Expert Contributor

Thank you Bryan. I tried searching on the body of the email and i got results. I was under the impression that we can't search on fields that are not stored.

Highlighted

Re: PutSolrContentStream extract Email Fields

There are two concepts, "indexed" and "stored"...

  • If a field is only indexed and not stored, then you can search on it but can't get the original value of the field back for search results.
  • If a field is only stored and not indexed, then you can't search on it, but you can use it as a return field for search results.
  • If a field is both then it can be searched and can also be used as a return field as search results.
Highlighted

Re: PutSolrContentStream extract Email Fields

Expert Contributor

Can i update the _text_ field and have it stored and indexed? Also is it possible to update the values and rename the fields using the PutContentStream processor? I am want to be able to store the location of the file im pulling from HDFS in a field in Solr.

Don't have an account?
Coming from Hortonworks? Activate your account here