Normally Hadoop is not able to merge lines together because the underlying tools all use a LineRecordReader that treats every text line as a record. However Hadoop can use different LineReaders as well. This is more scalable than perl or python scripts In this case I used a modified QuotationLineReader. The project has to copy a lot of the code of the standard TextFormat to make this change. Some more usage tips in the README.
I don't think so. Each line in the normal TextInputformat which is the basis for "stored as TEXT" makes every line (a string followed by a new line character ) into a record or row. So it would break before you could even use "regex_replace". If you have a way let me know though 🙂