Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Line Number in TextInput Format


Line Number in TextInput Format




I am new to Hadoop and started working on a project where we have to ignore few lines in input text format (Example :10 lines from the header). I searched a lot in internet but not able to find format where we can get line number from text input format. COuld you please let me know the best possible way to find the line number and ignore them in mapper.





Re: Line Number in TextInput Format

Master Guru
Given the parallel nature of MR execution, it is not possible to fetch line numbers of the being-processed line unless the line number is itself stored with the data.

What you do get though, is the offset length within the file that the line is being read from, which you can then use later to approximate the line number.

If line numbers are absolutely important to you, consider embedding them into the data via some preprocessing on the file.

If the only requirement is to avoid the first few lines of every file read, you can achieve this by writing a custom LineReader override that omits 10 lines if the FileSplit's offset is 0 (indicating it is the first block of the file, so it carries the header).
Don't have an account?
Coming from Hortonworks? Activate your account here