Support Questions

Find answers, ask questions, and share your expertise

Hive xpath: XML in a column where data contains XML brackets

avatar

We have a Hive External table where a couple of columns have XML data in them. The XML structure is valid but some of the records contain "?<" or "?>" in the data of an element. As you can see, the XML itself is valid as the value is quoted but Hive xpath is unhappy. Trying to figure out if there's a better way to escape out the data elements or have xpath ignore the extra "<".

Example:

<Root><data1><data2 id="4" value="622" name="CoID" /><data2 id="9999" value="Company XYZ" name="Company Name" /><data2 id="9999" value="2222345=0000?<" name="OfferId" /></data1></Root>

Query:

select
xpath_string(xml_column, '//Root [@name="OfferId"]/@value')
from table;

Here is the Error that we receive:

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 284; The value of attribute "value" associated with an element type "Root" must not contain the '<' character.
  at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
  at org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil.eval(UDFXPathUtil.java:90)
  ... 35 more
1 ACCEPTED SOLUTION

avatar

I figured out that if I use regex_replace, I can work around the data issue.

select
xpath_string(regex_replace(xml_column, '\\?\<', ''), '//Root [@name="OfferId"]/@value')
from table;

View solution in original post

1 REPLY 1

avatar

I figured out that if I use regex_replace, I can work around the data issue.

select
xpath_string(regex_replace(xml_column, '\\?\<', ''), '//Root [@name="OfferId"]/@value')
from table;