Created on 06-27-2017 01:50 PM - edited 09-16-2022 04:50 AM
Hi,
I have data coming in about 200 GB per day from Cassandra database into hdfs.... what are the disadvantages especially when the replication factor is 1 other than losing the data when the datanode fails....
I believe there will be lot of pressure on that node where the data exists ? I am trying to understand what happens during querying large chunks of data from these data nodes with rep factor set to 1.
Thanks.
Created 06-27-2017 06:27 PM
Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:
1. Data loss --> One or more datanode or disk failure will result in data loss.
2. Performance issues --> Having replication factor of more than 1 results in more parallelization.
3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.
Created 06-27-2017 02:27 PM
@PJ Even after setting replication factor as 1 the data would be split into blocks and would be distributed across different datanodes. So, incase of a datanode failure you will only be able to partially retrieve data. Other advantage of setting replication factor > 1 is parallel processing, i.e. you have multiple copies of data at multiple places and all the machines can simultaneously process data.
Created 06-27-2017 06:27 PM
Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:
1. Data loss --> One or more datanode or disk failure will result in data loss.
2. Performance issues --> Having replication factor of more than 1 results in more parallelization.
3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.