Created on 01-26-2014 07:26 AM - edited 09-16-2022 01:52 AM
I'm having trouble getting XSLT working with Morphlines. I'm using the XSLT as per the documentation
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/xslt
but it does not seem to pass on the elements and atributes as per this description...
"...For each item in the query result sequence, the morphline command converts the item to a record and pipes that record to the next morphline command. For an attribute node the attribute's XPath string value is filled into the record field named after the attribute name. For an element node the attributes and children of the element are treated as follows: The XPath string value of the attribute or child is filled into the record field named after the child's name..."
Bellow is a simple xml file, my morphlines script and the output of running this in TRACE mode as per the standlone test program
The same problem also happens in my full Cloudera Search deployment.
When you look at "output record:" from the script logger, you can see that only the Child1 element is passed through, but none of its attriblutes, nor the sub-child Child1_1 element?
If the above description of how XSLT is processed within Morphlines is true then why aren't all elements and attributes being passed through?
cat test.xml
<RootNode>
<Child1 A="a" B="B">
<Child1_1 C="c"/>
</Child1>
</RootNode>
cat /tmp/morphlog.txt
1479 [main] TRACE com.cloudera.cdk.morphline.saxon.XSLTBuilder$XSLT - beforeProcess: {_attachment_body=[java.io.BufferedInputStream@5de3eba1]}
1529 [main] TRACE com.cloudera.cdk.morphline.saxon.XSLTBuilder$XSLT - XSLT input document: <RootNode>
<Child1 A="a" B="B">
<Child1_1 C="c"/>
</Child1>
</RootNode>
1553 [main] TRACE com.cloudera.cdk.morphline.stdlib.GenerateUUIDBuilder$GenerateUUID - beforeProcess: {Child1=[
]}
1556 [main] TRACE com.cloudera.cdk.morphline.stdlib.LogInfoBuilder$LogInfo - beforeProcess: {Child1=[
], id=[c7a9acc1-a82b-4fa1-b491-603d5f570401]}
1556 [main] INFO com.cloudera.cdk.morphline.stdlib.LogInfoBuilder$LogInfo - output record: [{Child1=[
], id=[c7a9acc1-a82b-4fa1-b491-603d5f570401]}]
cat morphlines.conf
morphlines : [
{
id : morphtest
importCommands : ["com.cloudera.**"]
commands : [
{
xslt {
fragments : [
{
fragmentPath : "/"
queryString : """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
}
]
}
}
{ generateUUID { field : id } }
{ logInfo { format : "output record: {}", args : ["@{}"] } }
]
}
]
Created 01-27-2014 03:13 PM
Created 01-27-2014 02:07 PM
Since the sample XSLT was not working for me, I tried crafting a more customized XSLT to extract attribites as elements. While I have a little more success with this, it appears there is a bug in morphlines handling of the XSLT results. As you can see from this output, child element attributes are being merged together instead of being treated as distinct items (see red font text in log output below)?
At this point, XML processing appears to be broken in moprhlines 😞
cat morphlines.conf
morphlines : [
{
id : morphtest
importCommands : ["com.cloudera.**"]
commands : [
{
xslt {
fragments : [
{
fragmentPath : "/"
queryString : """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" >
<!-- <xsl:output method="xml" indent="yes" /> -->
<xsl:template match="//*[name() != 'RootNode']">
<xsl:element name="{name()}">
<xsl:for-each select="@*">
<xsl:element name="{name()}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:for-each>
<xsl:apply-templates select="*|text()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
"""
}
]
}
}
{ generateUUID { field : id } }
{ logInfo { format : "output record: {}", args : ["@{}"] } }
]
}
]
cat test2.xml
<?xml version="1.0"?>
<RootNode>
<Child1 A="a" B="B">
<Child1_1 C="c" D="d" E="e" />
<Child1_2 X="x" B="bb" />
</Child1>
</RootNode>
cat /tmp/morphlog.txt
1964 [main] TRACE com.cloudera.cdk.morphline.saxon.XSLTBuilder$XSLT - XSLT input document:
<RootNode>
<Child1 A="a" B="B">
<Child1_1 C="c" D="d" E="e"/>
<Child1_2 X="x" B="bb"/>
</Child1>
</RootNode>
1981 [main] TRACE com.cloudera.cdk.morphline.stdlib.GenerateUUIDBuilder$GenerateUUID - beforeProcess: {A=[a], B=[B], Child1_1=[cde], Child1_2=[xbb]}
1983 [main] TRACE com.cloudera.cdk.morphline.stdlib.LogInfoBuilder$LogInfo - beforeProcess: {A=[a], B=[B], Child1_1=[cde], Child1_2=[xbb], id=[28827c81-28ca-41b1-9411-b8697d765ac5]}
1984 [main] INFO com.cloudera.cdk.morphline.stdlib.LogInfoBuilder$LogInfo - output record: [{A=[a], B=[B], Child1_1=[cde], Child1_2=[xbb], id=[28827c81-28ca-41b1-9411-b8697d765ac5]}]
Notice that when I run the same xml and xsl via the standalone Saxon parser it correctly creates attributes as elements, so this leads me to think that once morphlines pipeline processes the Saxon output into a record, that this is where the merging of attributes problem is happening.
java -cp /opt/saxon/SaxonHE9-5-1-2/saxon9he.jar net.sf.saxon.Transform -s:test2.xml -xsl:test2.xsl
<?xml version="1.0" encoding="UTF-8"?>
<Child1><A>a</A><B>B</B>
<Child1_1><C>c</C><D>d</D><E>e</E></Child1_1>
<Child1_2><X>x</X><B>bb</B></Child1_2>
</Child1>
Created 01-27-2014 02:46 PM
Created 01-27-2014 02:59 PM
Created 01-27-2014 03:13 PM
Created 01-27-2014 04:27 PM
Created 01-28-2014 06:52 AM
I've take a different approach and created an XSLT that extracts the attributes from each element in the hierarchy and transforms them into a single level collection of elements. This seems to be giving me the results I expect now.
Created 01-28-2014 10:56 AM