Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How can we read many small files and have each record be defined by an arbitrary length?

Highlighted

How can we read many small files and have each record be defined by an arbitrary length?

New Contributor

There are many small files I wish to process as a large file (sounds like a sequence file?). I do not want to read the files line by line, instead I want each record to be defined by an arbitrary length., but I also want to track where each record came from.

Example:

file1.txt

01234567890123456789

012345

file2.txt

01234

01234567

arbitrary length: 10

key -> value

file1.txt -> 0123456789

file1.txt -> 0123456789

file1.txt -> 012345

file2.txt -> 0123401234

file2.txt -> 567