Support Questions

pmj · ‎06-27-2017

Hi,

I have data coming in about 200 GB per day from Cassandra database into hdfs.... what are the disadvantages especially when the replication factor is 1 other than losing the data when the datanode fails....

I believe there will be lot of pressure on that node where the data exists ? I am trying to understand what happens during querying large chunks of data from these data nodes with rep factor set to 1.

Thanks.

pardeep_kumar · ‎06-27-2017

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.

View solution in original post

ibhatt · ‎06-27-2017

@PJ Even after setting replication factor as 1 the data would be split into blocks and would be distributed across different datanodes. So, incase of a datanode failure you will only be able to partially retrieve data. Other advantage of setting replication factor > 1 is parallel processing, i.e. you have multiple copies of data at multiple places and all the machines can simultaneously process data.

pardeep_kumar · ‎06-27-2017

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.

Cloudera Community

Support Questions

Disadvantages of replication factor 1 on 200GB of data per day