How can i verify if there is any orphaned or abandoned data on a datanode ?
From the example below we see that /hadoop/sde is showing as 96% so before i do intra-Datanode balancer i wanted to verify that the data on this node is not actually orpohaned and the 96% is a cause of massive file deletion or addition of new DataNode disks.
/dev/sde 1.1T 1.1T 50G 96% /hadoop/sde
/dev/sdn 1.1T 762G 357G 69% /hadoop/sdn
Steps that i performed.
1. Searched for a random file on that node and ran hadoop fsck -blockId and found that there are only 3 replicas.
also looked at files older than 300 days and still see 3 live replicas.
As there any other way to verify ?
One of the common reason for datanodes going unbalanced is ingestion/data load.
The first copy of data is always stored on the same datanode from where you are loading data into HDFS. Second and third copy of the data will be stored on rest of the data nodes based on a round robin fashion. You can make name node to choose availabe space on data nodes instead of round robin fashion by setting "DataNode Volume Choosing Policy" appropriately.
How many nodes do you have in this cluster?
How do you push data into HDFS?