Created 12-06-2015 12:03 AM
Hadoop has long stressed moving the code to the data, both because it's faster to move the code than to move the data, and more importantly because the network is a limited shared resource that can easily be swamped. Erasure coding would seem to require that a large proportion of the data must move across the network because the contents of a single block will reside on multiple nodes. This would presumably apply not just the ToR switch, but the shared network as well, if the ability to tolerate the loss of a rack is preserved. Is this true and how are these principles reconciled?
Created 12-09-2015 01:42 PM
Note also you are going to get less IO bandwidth, as you move from 3 replicas (and hence 3 places to run code locally), to what is essentially a single replica, with the data spread across the network.
Erasure coding is for best storing cold data where the improvements in storage density is tangible: it will hurt performance through
-loss of locality (network layer)
-loss of replicas (disk IO layer)
-need to rebuild the raw data (CPU overhead)
I don't think we have any figures yet on the impact.
On a brighter note, 10GbE ToR switches are falling in price, so you could thing about going to 10 Gb on-rack, even if the backbone remains a bottleneck
Created 12-06-2015 12:59 PM
Not sure if you have seen this https://issues.apache.org/jira/browse/HDFS-8030
HDFS Erasure Coding Phase I (HDFS7285) enables EC based on the striped format. It achieves space saving but gives up data locality. Phase II of this project aims to support similar space saving based on the contiguous block layout.
Created 12-09-2015 01:42 PM
Note also you are going to get less IO bandwidth, as you move from 3 replicas (and hence 3 places to run code locally), to what is essentially a single replica, with the data spread across the network.
Erasure coding is for best storing cold data where the improvements in storage density is tangible: it will hurt performance through
-loss of locality (network layer)
-loss of replicas (disk IO layer)
-need to rebuild the raw data (CPU overhead)
I don't think we have any figures yet on the impact.
On a brighter note, 10GbE ToR switches are falling in price, so you could thing about going to 10 Gb on-rack, even if the backbone remains a bottleneck
Created 12-09-2015 04:47 PM
The documentation seems to suggest that the normal mode of use would be to have one reconstituted replica sitting around and that reconstituting an encoded block would be done only if this isn't the case. Keeping a block by default would eliminate most of the space savings because the data would expand from 1.6 to 2.6 times the raw file size. Why not have a policy that for leaves a single size copy for a limited time after a block is used? A "working set" as it were, so if you've used a block in the last X hours the decoded block won't be deleted.