Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

hive REGEXP_REPLACE seems to cuts large string containing xml

Highlighted

hive REGEXP_REPLACE seems to cuts large string containing xml

Rising Star

Hi have a large numbers of xml files stores in hbase, the files containing binary data like pdf. word etc.

The column contents holds content of the xml file.

I want to replace the binary value from the xml tag DokumentFilIndhold with the value "Content Removed"

 REGEXP_REPLACE(contents,"(?s)<ns0:DokumentFilIndhold[^>]*>.*?</ns0:DokumentFilIndhold>", "Content Removed")

The regular expression seems to work exactly as expected when i test it with https://regexr.com/

But when i run the query on my data it cuts of the contents. So its no longer a valid xml file.

Does the function REGEXP_REPLACE have some limitations or is it my expression that's wrong the value is up to 65000 chars.

Its Urgent for me to find a solution, so any idea will be very well recieved.