When a file is stored in HDFS it will be stored as chunks/blocks. This blocks in turn increases the processing in HDFS.
Consider the block size is 128MB and the file which im about to ingest is 500MB. When it comes to data ingestion say for example I'm copying a file from external source using hadoop fs -put command into HDFS, then file will be copied into HDFS which is not splittable. Will the data be stored in blocks or block size will be increased based on the file size? Whats the relation between splitting, blocks and performance. Correct me if my understanding in any of the above is wrong with your explanation.
Files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of under‐ lying storage. (For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)
So if you are file is of size 500MB it would be stored as 4 blocks( 3 blocks of size 128MB and 1 block of 116 MB size).
The main benefit of having block abstraction for a distributed filesystem is that a file can be larger than any single disk in the network.
@Vani Thanks and I do understand that. But my question is completely different maybe i havent explained it in brief. If we are not storing the file in Hadoop native format which supports splitting then it will end up in performance bottle neck. Under such conditions how the file is stored, assuming that the file is not splitted but has multiple blocks. Does it mean that mappers will be executed on the blocks regardless of the no of splits? If so there why there is a bottle neck?