Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Morphline removes xml tags from field

SOLVED Go to solution

Morphline removes xml tags from field

Expert Contributor

Hi, I use morphline to parse incomming xml and store it to Solr. The problem is that morphline removes all tags. I need to store to Solr a subtree from incomming XML. 

 

Example:

 <ecol:body>
            <out:StatusMessage xmlns:out="http://lol.ru/coordinate/v5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
               <out:ResponseDate xsi:nil="true"/>
               <out:PlanDate xsi:nil="true"/>
               <out:StatusCode>1040</out:StatusCode>
               <out:Responsible>
                  <out:LastName>XXXX</out:LastName>
                  <out:FirstName>YY</out:FirstName>
               </out:Responsible>
               <out:Note/>
               <out:ServiceNumber>123123123</out:ServiceNumber>
            </out:StatusMessage>
         </ecol:body>

 A part of my morphline config:

 return
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  {$entry/../../ecol:body}
                                </entry>

 a valaue for a <ecol:body> has: 1040 \n XXXX \n YY 12312312123 and ALL tags are removed. I want to leave tags. Is there anypossibility to do that?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Morphline removes xml tags from field

Expert Contributor
That’s the expected behavior per the doc at kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#xquery: "The XPath string value of the attribute or child is filled into the record field”. The “XPath string value” is the concatenation of the *text nodes* and it does not include the element names or attribute names, per www.w3.org/TR/xquery-operators/#func-string

Also, Solr/Lucene wouldn’t know that to do with those tag names anyway. A Lucene/Solr field holds primitive type such as a string, it doesn’t work with nested structures.

P.S. If you really need this, I believe saxon (and hence the xquery command) has an extension function to emit the serialization of an XML document into a string, but unfortunately that extension function is probably not available in the free "Saxon-HE" version that we ship with kite-morphlines-saxon: www.saxonica.com/documentation9.4-demo/html/extensions/functions/serialize.html

Alternatively, you could write your own custom morphline command that implements whatever xquery serialization logic you like, of course. The code would be be a copy n’ paste of the existing xquery command expect for adjusting this part: github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-saxon/src/main/java/org/kitesdk/morphline/saxon/XQueryBuilder.java#L196-L198
6 REPLIES 6

Re: Morphline removes xml tags from field

Expert Contributor

You need to change your xquery command to wrap your XML output into yet another XML element (e.g. “record”).

 

For example, in order to generate a morphline record with a “myFoo" field that contains “foo",

as well as a “myBar" field that contains “bar", your xquery command should be formulated such

that it outputs an XML fragment like this:

 

<record>

<myFoo>foo</myFoo>

<myBar>bar</myBar>

</record>

 

Re: Morphline removes xml tags from field

Expert Contributor

Cool, thanks! I'll try this evening.

 

 

Highlighted

Re: Morphline removes xml tags from field

Expert Contributor

Hi, I've tried this:

 return
                      
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  <body>{$entry/../../ecol:body}</body>
                                </entry>

 And that:

 return
                                <record>
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  <body>{$entry/../../ecol:body}</body>
                                </entry>
                                </record>

 Nothing helps. Looks like I don't understand the idea and how it works.

<body>{$entry/../../ecol:body}</body>

is extracted, but still ALL tags under <ecol:body/>  are moved.

What do I do wrong?

Re: Morphline removes xml tags from field

Expert Contributor

Result is the same I do get contents of expression {$entry/../../body},

but all tags are removed.

Imagine I had

<body>

<inner>inner text </inner>

</body>

I get "inner text" as a result.

I want to get <inner>inner text </inner> without removed tag

Re: Morphline removes xml tags from field

Expert Contributor
That’s the expected behavior per the doc at kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#xquery: "The XPath string value of the attribute or child is filled into the record field”. The “XPath string value” is the concatenation of the *text nodes* and it does not include the element names or attribute names, per www.w3.org/TR/xquery-operators/#func-string

Also, Solr/Lucene wouldn’t know that to do with those tag names anyway. A Lucene/Solr field holds primitive type such as a string, it doesn’t work with nested structures.

P.S. If you really need this, I believe saxon (and hence the xquery command) has an extension function to emit the serialization of an XML document into a string, but unfortunately that extension function is probably not available in the free "Saxon-HE" version that we ship with kite-morphlines-saxon: www.saxonica.com/documentation9.4-demo/html/extensions/functions/serialize.html

Alternatively, you could write your own custom morphline command that implements whatever xquery serialization logic you like, of course. The code would be be a copy n’ paste of the existing xquery command expect for adjusting this part: github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-saxon/src/main/java/org/kitesdk/morphline/saxon/XQueryBuilder.java#L196-L198

Re: Morphline removes xml tags from field

Expert Contributor

I've took

org.apache.flume.sink.solr.
morphline.UUIDInterceptor$Builder

as an example

My custom interceptor takes event body and stores it in event header.

Then SolrSink takes this header by default and sendt it to Solr for indexing.

it works

NB: solr schema.xml should have matching field declaration.