Support Questions

xzhou · ‎06-02-2016

For some unknown reasons, data read through DataNode could be very slow. In addition to troubleshooting root cause of slowness, are there any alternative ways (e.g. different input channels) but with the same semantics to make read potentially faster? Thanks.

jing · ‎06-03-2016

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

View solution in original post

jing · ‎06-03-2016

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

xzhou · ‎06-03-2016

Thank you @Jing Zhao, the hedge read is actually quite useful as a result of multiplexing.

Cloudera Community

Support Questions

How to make data read faster in HDFS