Community Articles

VidyaSargur · ‎08-16-2023

Hi Everyone,

I ran into a situation where I have to the following XML Input:

<workspace id="1234">
<documents totalCount="2">
<document id="5555" title="document1" revision="1">
<file name="file1.pdf" type="pdf"/>
<attributes>
	<attribute name="custAtt1" value="v1"/>
	<attribute name="custAtt2" value="v2"/>
	<attribute name="custAtt3" value=""/>
</attributes>
</document>
<document id="6666" title="document2" revision="0">
	<file name="file2.xlsx" type="xlsx"/>
	<attributes>
	<attribute name="custAtt1" value="v1"/>
	<attribute name="custAtt2" value="v2"/>
	<attribute name="custAtt3" value="v3"/>
	</attributes>
</document>
</documents>
</workspace>

Here, each /workspace/documents/document record needed to be split and transformed into JSON, where:

Each document object needs to have the workspace ID it belongs to.
Each document attribute value will be assigned a key in the format of document_[AttributeName].
Each document file attribute value will be assigned a key in the format of file_[AttributeName].
Each document custom attribute value will be assigned a key in the format custom_[AttributeName]. Empty custom attributes should be ignored.

For example, the first document record will look like the following in JSON:

{
	"workspace_id": "1234",
	"document_id": "5555",
	"document_title": "document1",
	"document_revision": "1",
	"file_name": "file1.pdf",
	"file_type": "pdf",
	"custom_custAtt1": "v1",
	"custom_custAtt2": "v2"
}

Traditional Approach:

The first approach that came to mind is to use the traditional processors of SplitXML, ConvertRecord and finally use some JoltTransformJson to flatten the JSON and re-map values to the proper keys. However, the jolt is not going to be that straightforward given the complexity of the XML which makes the result of the ConvertRecord processor hard to predict.

EvaluateXQuery to the Rescue:

After researching Xquery syntax, the EvaluateXQuery processor, and some testing, I found that it can be used as a once processor to split, convert, and transform the input XML to the required JSON format.

XQuery in general can support multiple data types of output: XML, HTML, and Text. We can utilize the Text data type (Output:Method property in the EvaluateXQuery processor) to produce any format we like, which is JSON in our case. In Xquery syntax, you can use XPath to access elements\attributes values in an XML structure, and it has a lot of built-in functions like "string-join" and "concat" besides variable declaration, all of which can help in the required transformation.

The beauty of EvaluateXQuery is that it will return new flowfile content (or attributes, depending on the Destination property setting) for each query result. For example, if we use a for loop to iterate through the different XML child elements and use a "return" statement in the iteration body, every return will result in a newly created flowfile (or nth attribute) in the Matched relationship. This serves as the needed split operation.

Without further due, here is how the EvaluateXQuery is configured in my case:

The XQueryValue is a dynamic property which contains the following Xquery:

let $workspaceId := workspace/@id
for $doc in /workspace/documents/document
	
  let $workspace_id  := concat('"workspace_id":"',$workspaceId,'"')
    
  let $doc_attr  := string-join( for $x in $doc/@* where $x!=''
                                   return
                                     concat(',"document_',name($x),'":"',data($x),'"')
                               ,'')
   
   
  let $doc_file  := string-join( for $x in $doc/file/@*
                                   return
                                     concat(',"file_',name($x),'":"',data($x),'"')
                              ,'') 
                              
                                    
  let $doc_custom:= string-join( for $x in $doc/attributes/attribute where $x/@value!=''
                                   return
                                     concat(',"custom_', $x/@name,'":"',$x/@value,'"')
                               ,'')
                                        
                                    
  let $doc_json:= string-join(                             
                               ("{",$workspace_id,$doc_attr,$doc_file,$doc_custom,"}")
                              ,'')
    
  return  $doc_json

Performance Analysis:

Despite the fact that the EvaluateXQuery in this case compensated for split, convertrecord, and Jolt, I had to make sure that it was still performant in comparison. To do that, I created the following scenarios to test against the same input (all executed on the primary node):

SplitXML -> ConvertRecord -> FlattenJson -> LogMessage
EvaluateXQuery -> LogMessage

The first scenario, of course, is much simplified just to see if EvaluateXQuery still performs better.

After doing stress testing (back pressure was applied), the average Lineage Duration at the end of each flow came out as follows:

First Scenario (split, convert, and transform):

Second Scenario (EvaluateXQuery):

We can see that Max/Mean Values for the EvaluateQuery are much better than traditional split, convert, and transform processors - 00:00:05.811 / 00:00:00.066 vs. 00:00:30.258 / 00:00:00.360

Conclusion:

Using the EvaluateXQuery simplified our data flow and gained us better performance when using the split, convertrecord, and transform processors. However, one needs to be careful when using such a processor, as it doesn’t guarantee valid JSON output, since the output method is set to text. Understanding the schema of your input XML and using XQuery functions like replace to accommodate for any invalid characters can help you avoid failures in the downstream processors.

Thanks for reading.

Please let me know if you have any comments or feedback.

Samer Alisaleh

Cloudera Community

Community Articles

How to split and transform complex XML with EvaluateXQuery Processor

Apache NiFi