Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How hadoop handles line record when it spans block boundary?

How hadoop handles line record when it spans block boundary?

New Contributor

Hello Forum, I have read the following statement in http://www.dummies.com/programming/big-data/hadoop/input-splits-in-hadoops-mapreduce/

"In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record".

I would like to know that is this statement true? Thanks

1 REPLY 1

Re: How hadoop handles line record when it spans block boundary?

Yes @Saravanan Selvam. If the record is large and if it can't fit into a split file then broken record will be created and placed in the new split file. Also it depends on the compression codec available in HDFS. Inside hadoop there are multiple ways of compressing a file like record compressed and block compressed. However the sync marker will be available to identify the record beginning and end. These record splits are handled by clients by InputFormat.getSplits.

I came across a brief and clear explanation same kind of question. Please do check it.

https://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boun...

Hope it Helps!!