<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question hive REGEXP_REPLACE seems to cuts large string containing xml in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/hive-REGEXP-REPLACE-seems-to-cuts-large-string-containing/m-p/235718#M197531</link>
    <description>&lt;P&gt;Hi have a large numbers of xml files stores in hbase, the files containing binary data like pdf. word etc.&lt;/P&gt;&lt;P&gt;The column contents holds content of the xml file. &lt;/P&gt;&lt;P&gt;I want to replace the binary  value from the xml tag DokumentFilIndhold with the value "Content Removed"&lt;/P&gt;&lt;PRE&gt; REGEXP_REPLACE(contents,"(?s)&amp;lt;ns0:DokumentFilIndhold[^&amp;gt;]*&amp;gt;.*?&amp;lt;/ns0:DokumentFilIndhold&amp;gt;", "Content Removed")&lt;/PRE&gt;&lt;P&gt;The regular expression seems to work exactly as expected when i test it  with &lt;A href="https://regexr.com/" target="_blank"&gt;https://regexr.com/&lt;/A&gt; &lt;/P&gt;&lt;P&gt;But when i run the query on my data it cuts of the contents. So its no longer a valid xml file. &lt;/P&gt;&lt;P&gt;Does the function REGEXP_REPLACE have some limitations or is it my expression that's wrong the value is up to 65000 chars. &lt;/P&gt;&lt;P&gt;Its Urgent for me to find a solution, so any idea will be very well recieved. &lt;/P&gt;&lt;P&gt; &lt;/P&gt;</description>
    <pubDate>Sat, 10 Nov 2018 18:41:28 GMT</pubDate>
    <dc:creator>simon_jespersen</dc:creator>
    <dc:date>2018-11-10T18:41:28Z</dc:date>
  </channel>
</rss>

