Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

hive REGEXP_REPLACE seems to cuts large string containing xml

avatar
Expert Contributor

Hi have a large numbers of xml files stores in hbase, the files containing binary data like pdf. word etc.

The column contents holds content of the xml file.

I want to replace the binary value from the xml tag DokumentFilIndhold with the value "Content Removed"

 REGEXP_REPLACE(contents,"(?s)<ns0:DokumentFilIndhold[^>]*>.*?</ns0:DokumentFilIndhold>", "Content Removed")

The regular expression seems to work exactly as expected when i test it with https://regexr.com/

But when i run the query on my data it cuts of the contents. So its no longer a valid xml file.

Does the function REGEXP_REPLACE have some limitations or is it my expression that's wrong the value is up to 65000 chars.

Its Urgent for me to find a solution, so any idea will be very well recieved.

1 REPLY 1

avatar

Only a partial answer but in general I do not think REGEX_REPLACE cuts large strings. It will be hard to figure this out in more detail unless you can share a reproducible example.

 

Here is what i tested just now:

1. Create a table that contains a string of 60000+ characters (lorem ipsum)

2. Create a new table by selecting the regex replace of that string (i replaced every a with b)

3. Counting the length of the field in the new table

 

---

 

As said, it may well be that you are using a very specific string or regex that together create this problem, it would be interesting to see if this could be reduced to a minimal example. -- Also keep in mind that though they are very similar, there are many ways a regex itself can be parsed, perhaps the test you did is simply slightly different than the implementation in Hive.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.