Hi Cloudera Experts,
I would appreciate it if you could answer the following questions.
1. What will happen if change IP address or Hostname of a Datanode while cluster is running? Can i access existing data from that Datanode or not?
2. If i configured replcation factor is less than number of Datanodes then what will happen?
For an Ex. If number of Datanodes are 5 and replicaton factor configured to number 3.
3. How to calculate replication factor? Is it depend on number of datanodes?
More exprienced people can correct me if I am wrong.
>1. What will happen if change IP address or Hostname of a Datanode while cluster is running? Can i access >existing data from that Datanode or not?
If your cluster configuration used ip addresses and the changed host hosts a datanode and a tasktracker/nodemanager, then communication between namenode and datanode will fail. Likewise communication between jobtracker/tasktracker, resourcemanager/nodemanager.
>2. If i configured replcation factor is less than number of Datanodes then what will happen?
>For an Ex. If number of Datanodes are 5 and replicaton factor configured to number 3.
Why should anything happen? If replication factor is three, that means there are three copies of each file in the cluster distributed on three datanodes. So either for performance or in case of failure of a node, the replicated copies will be used. Having an extra two nodes has nothing to do with this.
>3. How to calculate replication factor? Is it depend on number of dataondes
If you have replication factor of 5 because you have 5 nodes, that means 4 extra copies have to be written le for each file and that much overhead is involved. Is that really needed for your application/system?
vtpcnk is correct. A couple of notes, though:
1. Typically one would use hostnames for the nodes in the configuration. That way, if the IP address does need to change, then logically the cluster looks the same to Hadoop.
2. Configuring a replication fact less than the number of DataNodes is normal. If you have 3 disks and a replication factor of 3, you only have the capacity of 1 disk. If you have 9 disks and a replication factor of 3, you actually have the capacity of 3 disks with the increased reliability and increased parallelism (because 3 simultaneous jobs could access the same data without running on the same hardware).
3. A higher replication factor uses up more disk space, but increase reliability and parallel access to that data. So for small files that are very important or accessed extremely often, high replication factor makes sense. Generally speaking, the default of 3 is a very good default.