<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Hive xpath: XML in a column where data contains XML brackets in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Hive-xpath-XML-in-a-column-where-data-contains-XML-brackets/m-p/241407#M203210</link>
    <description>&lt;P&gt;We have a Hive External table where a couple of columns have XML data in them.  The XML structure is valid but some of the records contain "?&amp;lt;" or "?&amp;gt;" in the data of an element.  As you can see, the XML itself is valid as the value is quoted but Hive xpath is unhappy.  Trying to figure out if there's a better way to escape out the data elements or have xpath ignore the extra "&amp;lt;". &lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;PRE&gt;&amp;lt;Root&amp;gt;&amp;lt;data1&amp;gt;&amp;lt;data2 id="4" value="622" name="CoID" /&amp;gt;&amp;lt;data2 id="9999" value="Company XYZ" name="Company Name" /&amp;gt;&amp;lt;data2 id="9999" value="2222345=0000?&amp;lt;" name="OfferId" /&amp;gt;&amp;lt;/data1&amp;gt;&amp;lt;/Root&amp;gt;&lt;/PRE&gt;&lt;P&gt;Query:&lt;/P&gt;&lt;PRE&gt;select
xpath_string(xml_column, '//Root [@name="OfferId"]/@value')
from table;&lt;/PRE&gt;&lt;P&gt;Here is the Error that we receive:&lt;/P&gt;&lt;PRE&gt;Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 284; The value of attribute "value" associated with an element type "Root" must not contain the '&amp;lt;' character.
  at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
  at org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil.eval(UDFXPathUtil.java:90)
  ... 35 more&lt;/PRE&gt;</description>
    <pubDate>Thu, 03 Jan 2019 06:57:26 GMT</pubDate>
    <dc:creator>jason_zondor</dc:creator>
    <dc:date>2019-01-03T06:57:26Z</dc:date>
    <item>
      <title>Hive xpath: XML in a column where data contains XML brackets</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-xpath-XML-in-a-column-where-data-contains-XML-brackets/m-p/241407#M203210</link>
      <description>&lt;P&gt;We have a Hive External table where a couple of columns have XML data in them.  The XML structure is valid but some of the records contain "?&amp;lt;" or "?&amp;gt;" in the data of an element.  As you can see, the XML itself is valid as the value is quoted but Hive xpath is unhappy.  Trying to figure out if there's a better way to escape out the data elements or have xpath ignore the extra "&amp;lt;". &lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;PRE&gt;&amp;lt;Root&amp;gt;&amp;lt;data1&amp;gt;&amp;lt;data2 id="4" value="622" name="CoID" /&amp;gt;&amp;lt;data2 id="9999" value="Company XYZ" name="Company Name" /&amp;gt;&amp;lt;data2 id="9999" value="2222345=0000?&amp;lt;" name="OfferId" /&amp;gt;&amp;lt;/data1&amp;gt;&amp;lt;/Root&amp;gt;&lt;/PRE&gt;&lt;P&gt;Query:&lt;/P&gt;&lt;PRE&gt;select
xpath_string(xml_column, '//Root [@name="OfferId"]/@value')
from table;&lt;/PRE&gt;&lt;P&gt;Here is the Error that we receive:&lt;/P&gt;&lt;PRE&gt;Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 284; The value of attribute "value" associated with an element type "Root" must not contain the '&amp;lt;' character.
  at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
  at org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil.eval(UDFXPathUtil.java:90)
  ... 35 more&lt;/PRE&gt;</description>
      <pubDate>Thu, 03 Jan 2019 06:57:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-xpath-XML-in-a-column-where-data-contains-XML-brackets/m-p/241407#M203210</guid>
      <dc:creator>jason_zondor</dc:creator>
      <dc:date>2019-01-03T06:57:26Z</dc:date>
    </item>
    <item>
      <title>Re: Hive xpath: XML in a column where data contains XML brackets</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Hive-xpath-XML-in-a-column-where-data-contains-XML-brackets/m-p/241408#M203211</link>
      <description>&lt;P&gt;I figured out that if I use regex_replace, I can work around the data issue. &lt;/P&gt;&lt;PRE&gt;select
xpath_string(regex_replace(xml_column, '\\?\&amp;lt;', ''), '//Root [@name="OfferId"]/@value')
from table;&lt;/PRE&gt;</description>
      <pubDate>Thu, 03 Jan 2019 23:04:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Hive-xpath-XML-in-a-column-where-data-contains-XML-brackets/m-p/241408#M203211</guid>
      <dc:creator>jason_zondor</dc:creator>
      <dc:date>2019-01-03T23:04:10Z</dc:date>
    </item>
  </channel>
</rss>

