We are living high rates of disk failures on datanodes recently.
Some of failures are logical (file system on some of disks become read-only) and some of them are physical (hardware error leds blink red on servers). They became frequent recently.
Saying frequent, in last two weeks we lived 4 disk errors of which two were hardware errors. Our cluster has 9 datanodes and each has 6TiBs of 14 disks. They run on CentOS 7.2.1511
We have lived 5 physical hardware problems in 7 months since the initial start of the cluster.
What I would like to learn is, is that number of failures acceptable? Does the number / rate match your experiences?
PS: We have a file system block size of 4096 bytes. Is this number low for our disks in this size? Thanks in advance...
Hi @Sedat Kestepe,
HDD reliability is measured as AFR (annualized failure rate) - the probability a hard disk will fail in a given year. AFR varies with manufacturer, model number and operating conditions. Here is one publicly available report about disk AFRs:
There is one obvious bad batch in that report with 33% AFR. For the rest it varies from 0 - 3%. Your hardware vendor should be able to provide you with the expected AFR. If your observed failure rate is higher than expected you may have a bad batch of hardware and should check with your vendor. Burn-in testing can help weed out bad hardware early on.