Support Questions

Find answers, ask questions, and share your expertise

How to make data read faster in HDFS

avatar
Explorer

For some unknown reasons, data read through DataNode could be very slow. In addition to troubleshooting root cause of slowness, are there any alternative ways (e.g. different input channels) but with the same semantics to make read potentially faster? Thanks.

1 ACCEPTED SOLUTION

avatar
Contributor

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

View solution in original post

2 REPLIES 2

avatar
Contributor

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

avatar
Explorer

Thank you @Jing Zhao, the hedge read is actually quite useful as a result of multiplexing.