- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive xpath: XML in a column where data contains XML brackets
- Labels:
-
Apache Hive
Created ‎01-02-2019 10:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a Hive External table where a couple of columns have XML data in them. The XML structure is valid but some of the records contain "?<" or "?>" in the data of an element. As you can see, the XML itself is valid as the value is quoted but Hive xpath is unhappy. Trying to figure out if there's a better way to escape out the data elements or have xpath ignore the extra "<".
Example:
<Root><data1><data2 id="4" value="622" name="CoID" /><data2 id="9999" value="Company XYZ" name="Company Name" /><data2 id="9999" value="2222345=0000?<" name="OfferId" /></data1></Root>
Query:
select xpath_string(xml_column, '//Root [@name="OfferId"]/@value') from table;
Here is the Error that we receive:
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 284; The value of attribute "value" associated with an element type "Root" must not contain the '<' character.   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)   at org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil.eval(UDFXPathUtil.java:90)   ... 35 more
Created ‎01-03-2019 03:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I figured out that if I use regex_replace, I can work around the data issue.
select xpath_string(regex_replace(xml_column, '\\?\<', ''), '//Root [@name="OfferId"]/@value') from table;
Created ‎01-03-2019 03:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I figured out that if I use regex_replace, I can work around the data issue.
select xpath_string(regex_replace(xml_column, '\\?\<', ''), '//Root [@name="OfferId"]/@value') from table;
