Where does Hadoop stores its Data?
HDFS is the storage mechanism of Hadoop which stores very large files running on the cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. It stores data reliably even in the case of hardware failure. In HDFS, Files are broken into blocks that are distributed across the cluster on the basis of replication factor. The default replication factor is 3, thus each block is replicated 3 times. The first replica is stored on the first data node. The second replica is stored on another datanode within the same rack to minimize network dependency and third is stored on datanode in different racks, ensuring that even if rack fails the data is not lost. Namenode keeps the information of blocks like number of blocks, their replicas, and other details. While Datanode stores actual data and performs various operations like block creation, deletion and replication according to instruction of Namenode. Namenode keeps all meta data like data node location, blocks in it, replication factor etc..Data ode stores actual data and performs instructions given by namenode.
HDFS data is distributed across datanodes in local file system storage. You can configure list of storage disk dfs.datanode.data.dir in hdfs-site.xml.
dfs.datanode.data.dir - Determines where on the local filesystem an HDFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
These are spam accounts, by the way. Look at all the "answers" from the other users for every question, and they all link back to dataflair's website.