I have a small XML file which I am extracting XPath values to attributes from in EvaluateXPath 1.7.0. The XML is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<sports-content path-id="general/l.sportsml.com.general/advisory/xt.pa.json.20180730123447-heartbeat">
<sports-metadata doc-id="xt.pa.json.20180730123447-heartbeat" date-time="2018-07-30T12:34:47+00:00" language="en-US" document-class="advisory" fixture-key="heartbeat" fixture-name="Heartbeat">
<sports-title/>
<sports-content-codes>
<sports-content-code code-type="publisher" code-key="padatasports.com" code-name="The Press Association"/>
<sports-content-code code-type="distributor" code-key="xmlteam.com" code-name="XML Team Solutions, Inc."/>
<sports-content-code code-type="sport" code-key="15000000" code-name="General"/>
<sports-content-code code-type="league" code-key="l.sportsml.com.general" code-source="xmlteam.com" code-name="General"/>
</sports-content-codes>
</sports-metadata>
</sports-content>
The 3 attributes' XPath expressions are: string(/sports-content/sports-metadata/@date-time), string(/sports-content/sports-metadata/@doc-id), string(/sports-content/@path-id)
The attributes are extracted properly but I am getting
javax.xml.xpath.XPathExpressionException: Failure converting a node of class javax.xml.transform.sax.SAXSource: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog
Thinking that there might indeed be unseen characters at the beginning or end of the file I looked at it using a hex viewer and found a control character at the end. I added in a ReplaceText node with
(?s)^[^\<]*(<.*</sports-content>).*$
being replaced by $1
This worked to remove the control character but it did not get rid of the error message. As I said, I am able to get the attribute values I need, but I do not want to have my log filled with error messages, one for each flowfile.