Support Questions

Find answers, ask questions, and share your expertise

Does the service stop if a disk io error occurs during operation? Or maintain?

avatar
New Contributor


hdfs, yarn, impala, and kudu are currently in service.

If an io error occurs on one disk in one server during operation,
Is the service down?
Should I just leave it until I replace the disk?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@flashone 

If a disk error is detected, HDFS can mark the affected disk as failed and stop using it. HDFS DataNodes are designed to handle disk failures gracefully. If you have replication set up correctly, the data should remain accessible, though replication might temporarily increase on other nodes to compensate for the loss.

The service itself (HDFS) will usually stay operational as long as there are other healthy disks and nodes available

++++++++++++++

YARN NodeManagers can handle disk failures by marking disks as unhealthy if configured to monitor disk health, When a disk fails, NodeManager excludes that disk from the list of usable directories.

The NodeManager service itself will continue running as long as other disks are healthy.

++++++++++++++++

If Impala detects a disk I/O error, it will stop using that disk. The Impala Daemon will continue running, but queries that rely on data stored on the failed disk might fail until data can be accessed from another replica or node.

+++++++++++++

Kudu Tablet Servers monitor disk health, and if a disk fails, Kudu can mark it as failed and continue operating if there are other healthy disks. However, if the failure impacts multiple disks or replicas, it can lead to data availability issues.

++++++++++++

You can usually keep the services running if only a single disk fails and if replication is properly configured. However, it’s best to replace the failed disk promptly to avoid further risk. In HDFS and Kudu especially, losing additional disks could risk data loss or availability issues.

View solution in original post

3 REPLIES 3

avatar
Community Manager

@flashone, Welcome to our community! To help you get the best possible answer, I have tagged in our experts @Kartik_Agarwal @RAGHUY who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Collaborator

@flashone 

If a disk error is detected, HDFS can mark the affected disk as failed and stop using it. HDFS DataNodes are designed to handle disk failures gracefully. If you have replication set up correctly, the data should remain accessible, though replication might temporarily increase on other nodes to compensate for the loss.

The service itself (HDFS) will usually stay operational as long as there are other healthy disks and nodes available

++++++++++++++

YARN NodeManagers can handle disk failures by marking disks as unhealthy if configured to monitor disk health, When a disk fails, NodeManager excludes that disk from the list of usable directories.

The NodeManager service itself will continue running as long as other disks are healthy.

++++++++++++++++

If Impala detects a disk I/O error, it will stop using that disk. The Impala Daemon will continue running, but queries that rely on data stored on the failed disk might fail until data can be accessed from another replica or node.

+++++++++++++

Kudu Tablet Servers monitor disk health, and if a disk fails, Kudu can mark it as failed and continue operating if there are other healthy disks. However, if the failure impacts multiple disks or replicas, it can lead to data availability issues.

++++++++++++

You can usually keep the services running if only a single disk fails and if replication is properly configured. However, it’s best to replace the failed disk promptly to avoid further risk. In HDFS and Kudu especially, losing additional disks could risk data loss or availability issues.

avatar
New Contributor

Thanks ~ Good Answer~