We are looking for guidance on the Accumulo Replication feature documented here (https://accumulo.apache.org/1.7/accumulo_user_manual.html#_replication).
We set up replication between a primary site A and two destination sites B and C. What we are seeing is that if either of the destination sites B or C are down (powered down) or
not functioning correctly (Zookeeper on site C was down because the hard drive was full on some of these nodes), that we were quickly seeing issues with the primary site (site A).
Based on the documentation, we did not anticipate seeing issues with site A. We thought the walogs would have built up on site A over time and that when the issues with the destination
sites were resolved, that these logs would flush.
We saw a significant degradation in the primary Accumulo cluster on the source site (site A). In the tablet server logs we saw it continuously trying to connect to the zookeeper ensemble
at the downed destination sites. Here are the modifications we made to the Accumulo replication settings:
Accumulo Replication Configuration Changes:
-- decrease replication.worker.threads from 4 to 1
-- decrease replication.work.attempts from 10 to 1
-- increase replication.work.assignment.sleep from 30s to 30m
Thoughts? Thanks in Advance!