In HDFS, is it possible to identify primary data files from replicated data files?
ie let us suppose I have t datanodes running on three machines with default replication factor of three. Then I copy over a 3 gb file and it gets split amongst the three nodes @ 1 gb each. But every node also contains the replicated data of the other two nodes. So basically each data node will have 3 gb worth of data – 1 gb of its primary data and 2 gb (1 + 1) of replicated data.
So in this scenario, is it possible to identify which files constitute primary data for the node and which files represent replicated data.
Hope I am not confusing.
Appreciate the insights.