Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Extracthbase cell command does not retain xml tags

avatar
Champion Alumni

Hi,

 

I am inserting an xml  into  hbase column familiy and indexing it to solr.One of the solr fields is the  complete xml and other fields are the vvalues extracted from xml.How ever I am missing the xml tags in the indexed value.

 

 

I am taking the value out as a string.While writing into hbase  I  set character encoding as utf-8 and  also do the same on my  java code.I  have to display actualMessage field  as solr result(its one of the fields),It is getting displayed  but with out xml tags or attribute values.Can you help?.

 

{
extractHBaseCells {
mappings : [
{
inputColumn : "messages:*"
outputField : "actualMessage"
type : string
source : value
}
]
}
}

 

java {
imports : "import java.io.*;import javax.xml.parsers.*;import org.w3c.dom.*;"
code: """
String s =null;
byte [] b =null;
DocumentBuilderFactory docFactory = null;
DocumentBuilder docBuilder = null;
Document document = null;
InputStream is =null;
try{
s = (String)record.get("actualMessage").get(0);
b = s.getBytes("UTF-8");

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator
If indeed the data in HBase contains the XML tags, then it sounds like your tokenizer/analyzer chain in Solr schema.xml is stripping info away, i.e. schema.xml isn?t configured to do what you want it to do.

You could confirm that the morphline is doing what it?s supposed to do by adding some debug log message like this to your morphline:

logInfo { format : "my record: {}", args : ["@{}"] }

Also see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr

Wolfgang.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator
If indeed the data in HBase contains the XML tags, then it sounds like your tokenizer/analyzer chain in Solr schema.xml is stripping info away, i.e. schema.xml isn?t configured to do what you want it to do.

You could confirm that the morphline is doing what it?s supposed to do by adding some debug log message like this to your morphline:

logInfo { format : "my record: {}", args : ["@{}"] }

Also see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr

Wolfgang.

avatar
Champion Alumni

Thanks mate.It worked.Thanks a lot for all your help in this