(Expected or acceptable) (number or rate) of disk failures

Rising Star

We are living high rates of disk failures on datanodes recently.

Some of failures are logical (file system on some of disks become read-only) and some of them are physical (hardware error leds blink red on servers). They became frequent recently.

Saying frequent, in last two weeks we lived 4 disk errors of which two were hardware errors. Our cluster has 9 datanodes and each has 6TiBs of 14 disks. They run on CentOS 7.2.1511

We have lived 5 physical hardware problems in 7 months since the initial start of the cluster.

What I would like to learn is, is that number of failures acceptable? Does the number / rate match your experiences?

PS: We have a file system block size of 4096 bytes. Is this number low for our disks in this size? Thanks in advance...


Rising Star

Does anyone have comment for this?

Hi @Sedat Kestepe,

HDD reliability is measured as AFR (annualized failure rate) - the probability a hard disk will fail in a given year. AFR varies with manufacturer, model number and operating conditions. Here is one publicly available report about disk AFRs:

There is one obvious bad batch in that report with 33% AFR. For the rest it varies from 0 - 3%. Your hardware vendor should be able to provide you with the expected AFR. If your observed failure rate is higher than expected you may have a bad batch of hardware and should check with your vendor. Burn-in testing can help weed out bad hardware early on.

Rising Star

@Arpit Agarwal have you faced (or heard) about logical errors I mentioned? For this case we unmount and mount the problematic partition then we can go on.