Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Disadvantages of replication factor 1 on 200GB of data per day

Solved Go to solution

Disadvantages of replication factor 1 on 200GB of data per day

Expert Contributor

Hi,

I have data coming in about 200 GB per day from Cassandra database into hdfs.... what are the disadvantages especially when the replication factor is 1 other than losing the data when the datanode fails....

I believe there will be lot of pressure on that node where the data exists ? I am trying to understand what happens during querying large chunks of data from these data nodes with rep factor set to 1.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Disadvantages of replication factor 1 on 200GB of data per day

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.

2 REPLIES 2

Re: Disadvantages of replication factor 1 on 200GB of data per day

Contributor

@PJ Even after setting replication factor as 1 the data would be split into blocks and would be distributed across different datanodes. So, incase of a datanode failure you will only be able to partially retrieve data. Other advantage of setting replication factor > 1 is parallel processing, i.e. you have multiple copies of data at multiple places and all the machines can simultaneously process data.

Re: Disadvantages of replication factor 1 on 200GB of data per day

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.

Don't have an account?
Coming from Hortonworks? Activate your account here