One datanode was down due to temporary network disconnection, and it was back online about 30~40mins later. We observed that NameNode was busy and unresponsive, and a lot of nodes reported the incoming and outgoing traffic more than 800Mbps during this down time.
We didn't have jobs running at time. I understood HDFS was busy to copying the blocks under the replication number. But this made the whole cluster significantly downgraded. Is it normal?
We have replication factor = 3. 16 nodes each has about 8TB data.