In HDFS, is it possible to identify primary data files from replicated data files?
ie let us suppose I have t datanodes running on three machines with default replication factor of three. Then I copy over a 3 gb file and it gets split amongst the three nodes @ 1 gb each. But every node also contains the replicated data of the other two nodes. So basically each data node will have 3 gb worth of data – 1 gb of its primary data and 2 gb (1 + 1) of replicated data.
So in this scenario, is it possible to identify which files constitute primary data for the node and which files represent replicated data.
As far as I know, there is no distinction in HDFS between primary replicas and secondary replicas. There are just a certain number of replicas of each block. The NameNode maps block IDs to their locations, and no location is necessarily higher importance than another.