question Re: How to make data read faster in HDFS in Archives of Support Questions (Read Only)

How to make data read faster in HDFS

xzhou — Fri, 03 Jun 2016 06:39:27 GMT

For some unknown reasons, data read through DataNode could be very slow. In addition to troubleshooting root cause of slowness, are there any alternative ways (e.g. different input channels) but with the same semantics to make read potentially faster? Thanks.

Re: How to make data read faster in HDFS

jing — Fri, 03 Jun 2016 07:05:12 GMT

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

Re: How to make data read faster in HDFS

xzhou — Fri, 03 Jun 2016 07:19:48 GMT

Thank you @Jing Zhao, the hedge read is actually quite useful as a result of multiplexing.