Code Repositories

Find and share code repositories
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!
Labels (2)
avatar
Repo Description

Normally Hadoop is not able to merge lines together because the underlying tools all use a LineRecordReader that treats every text line as a record. However Hadoop can use different LineReaders as well. This is more scalable than perl or python scripts In this case I used a modified QuotationLineReader. The project has to copy a lot of the code of the standard TextFormat to make this change. Some more usage tips in the README.

Repo Info
Github Repo URL https://github.com/benleon/NewLineRemover
Github account name benleon
Repo name NewLineRemover
2,299 Views
Comments

@Benjamin Leonhardi

HIVE's regexp_replace can also be used

I don't think so. Each line in the normal TextInputformat which is the basis for "stored as TEXT" makes every line (a string followed by a new line character ) into a record or row. So it would break before you could even use "regex_replace". If you have a way let me know though 🙂