Normally Hadoop is not able to merge lines together because the underlying tools all use a LineRecordReader that treats every text line as a record. However Hadoop can use different LineReaders as well. This is more scalable than perl or python scripts In this case I used a modified QuotationLineReader. The project has to copy a lot of the code of the standard TextFormat to make this change. Some more usage tips in the README.