We have a Hive External table where a couple of columns have XML data in them. The XML structure is valid but some of the records contain "?<" or "?>" in the data of an element. As you can see, the XML itself is valid as the value is quoted but Hive xpath is unhappy. Trying to figure out if there's a better way to escape out the data elements or have xpath ignore the extra "<".
Example:
<Root><data1><data2 id="4" value="622" name="CoID" /><data2 id="9999" value="Company XYZ" name="Company Name" /><data2 id="9999" value="2222345=0000?<" name="OfferId" /></data1></Root>
Query:
select
xpath_string(xml_column, '//Root [@name="OfferId"]/@value')
from table;
Here is the Error that we receive:
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 284; The value of attribute "value" associated with an element type "Root" must not contain the '<' character.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil.eval(UDFXPathUtil.java:90)
... 35 more