Created 03-21-2016 09:42 PM
While running the latest Sandbox (HDP 2.4 on Hortonworks Sandbox), I noticed HDFS had 500+ under replicated blocks (via Ambari). Opening /etc/hadoop/conf/hdfs-site.xml, dfs.replication=3 (default http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)
Does anyone know why the Sandbox uses a HDFS replication factor of 3, aside from the fact that its the HDFS default? I'd assume most Sandbox users are running a virtual machine representing one node. If this is the case, dfs.replication=1 in the Sandbox to prevent under replicated blocks. Is my assumption incorrect?
Created 03-22-2016 02:05 AM
Yes you are right, it does not make sense to have a 3x replication. It is a default so it is set to 3. I have thoughts about it.
But the other way of looking at replication is if you are going after the same table and a node is busy, which does not apply in this case exactly, you can run the same query on another node where the replication is available.
I would leave it to 3, incase someone add more nodes to the VMs, the data gets replicated correctly.
Created 03-21-2016 10:39 PM
I will escalate this thank you for bringing this up.
Created 03-22-2016 02:05 AM
Yes you are right, it does not make sense to have a 3x replication. It is a default so it is set to 3. I have thoughts about it.
But the other way of looking at replication is if you are going after the same table and a node is busy, which does not apply in this case exactly, you can run the same query on another node where the replication is available.
I would leave it to 3, incase someone add more nodes to the VMs, the data gets replicated correctly.
Created 04-13-2016 12:07 PM
@Ryan Cicak The sandbox provides many of the defaults used during normal installation. You can change the 3x replication in the configs but the sandbox is mainly to allow usage of the tutorials.