Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Extracthbase cell command does not retain xml tags

Solved Go to solution

Extracthbase cell command does not retain xml tags

Champion Alumni

Hi,

 

I am inserting an xml  into  hbase column familiy and indexing it to solr.One of the solr fields is the  complete xml and other fields are the vvalues extracted from xml.How ever I am missing the xml tags in the indexed value.

 

 

I am taking the value out as a string.While writing into hbase  I  set character encoding as utf-8 and  also do the same on my  java code.I  have to display actualMessage field  as solr result(its one of the fields),It is getting displayed  but with out xml tags or attribute values.Can you help?.

 

{
extractHBaseCells {
mappings : [
{
inputColumn : "messages:*"
outputField : "actualMessage"
type : string
source : value
}
]
}
}

 

java {
imports : "import java.io.*;import javax.xml.parsers.*;import org.w3c.dom.*;"
code: """
String s =null;
byte [] b =null;
DocumentBuilderFactory docFactory = null;
DocumentBuilder docBuilder = null;
Document document = null;
InputStream is =null;
try{
s = (String)record.get("actualMessage").get(0);
b = s.getBytes("UTF-8");

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Extracthbase cell command does not retain xml tags

Expert Contributor
If indeed the data in HBase contains the XML tags, then it sounds like your tokenizer/analyzer chain in Solr schema.xml is stripping info away, i.e. schema.xml isn?t configured to do what you want it to do.

You could confirm that the morphline is doing what it?s supposed to do by adding some debug log message like this to your morphline:

logInfo { format : "my record: {}", args : ["@{}"] }

Also see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr

Wolfgang.

2 REPLIES 2
Highlighted

Re: Extracthbase cell command does not retain xml tags

Expert Contributor
If indeed the data in HBase contains the XML tags, then it sounds like your tokenizer/analyzer chain in Solr schema.xml is stripping info away, i.e. schema.xml isn?t configured to do what you want it to do.

You could confirm that the morphline is doing what it?s supposed to do by adding some debug log message like this to your morphline:

logInfo { format : "my record: {}", args : ["@{}"] }

Also see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr

Wolfgang.

Re: Extracthbase cell command does not retain xml tags

Champion Alumni

Thanks mate.It worked.Thanks a lot for all your help in this