Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to make data read faster in HDFS

avatar
New Member

For some unknown reasons, data read through DataNode could be very slow. In addition to troubleshooting root cause of slowness, are there any alternative ways (e.g. different input channels) but with the same semantics to make read potentially faster? Thanks.

1 ACCEPTED SOLUTION

avatar
New Member

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

View solution in original post

2 REPLIES 2

avatar
New Member

First you may need to figure out the root cause for the read slowness: network issue? Slow disk? You can identify the corresponding DataNode that serves the data and then check its metrics to help debugging the issue.

In the meanwhile, if the read is "position read", i.e., the read is called through API read(long, byte[], int, int), you can enable hedge read in DFSClient by setting the configuration "dfs.client.hedged.read.threadpool.size" to a non-zero number. Hedge read allows the reader to start reading from another DataNode (since there are usually 3 replicas) before the first read attempt finishes, if the reader thinks the first DataNode it read from is slow.

avatar
New Member

Thank you @Jing Zhao, the hedge read is actually quite useful as a result of multiplexing.