Reply
Highlighted
Expert Contributor
Posts: 162
Registered: ‎07-29-2013
Accepted Solution

Morphline removes xml tags from field

Hi, I use morphline to parse incomming xml and store it to Solr. The problem is that morphline removes all tags. I need to store to Solr a subtree from incomming XML. 

 

Example:

 <ecol:body>
            <out:StatusMessage xmlns:out="http://lol.ru/coordinate/v5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
               <out:ResponseDate xsi:nil="true"/>
               <out:PlanDate xsi:nil="true"/>
               <out:StatusCode>1040</out:StatusCode>
               <out:Responsible>
                  <out:LastName>XXXX</out:LastName>
                  <out:FirstName>YY</out:FirstName>
               </out:Responsible>
               <out:Note/>
               <out:ServiceNumber>123123123</out:ServiceNumber>
            </out:StatusMessage>
         </ecol:body>

 A part of my morphline config:

 return
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  {$entry/../../ecol:body}
                                </entry>

 a valaue for a <ecol:body> has: 1040 \n XXXX \n YY 12312312123 and ALL tags are removed. I want to leave tags. Is there anypossibility to do that?

Cloudera Employee
Posts: 145
Registered: ‎08-21-2013

Re: Morphline removes xml tags from field

You need to change your xquery command to wrap your XML output into yet another XML element (e.g. “record”).

 

For example, in order to generate a morphline record with a “myFoo" field that contains “foo",

as well as a “myBar" field that contains “bar", your xquery command should be formulated such

that it outputs an XML fragment like this:

 

<record>

<myFoo>foo</myFoo>

<myBar>bar</myBar>

</record>

 

Expert Contributor
Posts: 162
Registered: ‎07-29-2013

Re: Morphline removes xml tags from field

Cool, thanks! I'll try this evening.

 

 

Expert Contributor
Posts: 162
Registered: ‎07-29-2013

Re: Morphline removes xml tags from field

Hi, I've tried this:

 return
                      
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  <body>{$entry/../../ecol:body}</body>
                                </entry>

 And that:

 return
                                <record>
                                 <entry>
                                  {$entry/attr:ssoId}
                                  {$entry/attr:applicationId}
                                  <body>{$entry/../../ecol:body}</body>
                                </entry>
                                </record>

 Nothing helps. Looks like I don't understand the idea and how it works.

<body>{$entry/../../ecol:body}</body>

is extracted, but still ALL tags under <ecol:body/>  are moved.

What do I do wrong?

Expert Contributor
Posts: 162
Registered: ‎07-29-2013

Re: Morphline removes xml tags from field

Result is the same I do get contents of expression {$entry/../../body},

but all tags are removed.

Imagine I had

<body>

<inner>inner text </inner>

</body>

I get "inner text" as a result.

I want to get <inner>inner text </inner> without removed tag

Cloudera Employee
Posts: 145
Registered: ‎08-21-2013

Re: Morphline removes xml tags from field

That’s the expected behavior per the doc at kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#xquery: "The XPath string value of the attribute or child is filled into the record field”. The “XPath string value” is the concatenation of the *text nodes* and it does not include the element names or attribute names, per www.w3.org/TR/xquery-operators/#func-string

Also, Solr/Lucene wouldn’t know that to do with those tag names anyway. A Lucene/Solr field holds primitive type such as a string, it doesn’t work with nested structures.

P.S. If you really need this, I believe saxon (and hence the xquery command) has an extension function to emit the serialization of an XML document into a string, but unfortunately that extension function is probably not available in the free "Saxon-HE" version that we ship with kite-morphlines-saxon: www.saxonica.com/documentation9.4-demo/html/extensions/functions/serialize.html

Alternatively, you could write your own custom morphline command that implements whatever xquery serialization logic you like, of course. The code would be be a copy n’ paste of the existing xquery command expect for adjusting this part: github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-saxon/src/main/java/org/kitesdk/morphline/saxon/XQueryBuilder.java#L196-L198
Expert Contributor
Posts: 162
Registered: ‎07-29-2013

Re: Morphline removes xml tags from field

I've took

org.apache.flume.sink.solr.
morphline.UUIDInterceptor$Builder

as an example

My custom interceptor takes event body and stores it in event header.

Then SolrSink takes this header by default and sendt it to Solr for indexing.

it works

NB: solr schema.xml should have matching field declaration.

 

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.