Code Repositories

Find and share code repositories
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (2)
avatar
Super Guru
Repo Description

Normally Hadoop is not able to merge lines together because the underlying tools all use a LineRecordReader that treats every text line as a record. However Hadoop can use different LineReaders as well. This is more scalable than perl or python scripts In this case I used a modified QuotationLineReader. The project has to copy a lot of the code of the standard TextFormat to make this change. Some more usage tips in the README.

Repo Info
Github Repo URL https://github.com/benleon/NewLineRemover
Github account name benleon
Repo name NewLineRemover
2,352 Views
Comments

@Benjamin Leonhardi

HIVE's regexp_replace can also be used

avatar
Super Guru

I don't think so. Each line in the normal TextInputformat which is the basis for "stored as TEXT" makes every line (a string followed by a new line character ) into a record or row. So it would break before you could even use "regex_replace". If you have a way let me know though 🙂