Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Disadvantages of replication factor 1 on 200GB of data per day

avatar
Expert Contributor

Hi,

I have data coming in about 200 GB per day from Cassandra database into hdfs.... what are the disadvantages especially when the replication factor is 1 other than losing the data when the datanode fails....

I believe there will be lot of pressure on that node where the data exists ? I am trying to understand what happens during querying large chunks of data from these data nodes with rep factor set to 1.

Thanks.

1 ACCEPTED SOLUTION

avatar

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.

View solution in original post

2 REPLIES 2

avatar
Rising Star

@PJ Even after setting replication factor as 1 the data would be split into blocks and would be distributed across different datanodes. So, incase of a datanode failure you will only be able to partially retrieve data. Other advantage of setting replication factor > 1 is parallel processing, i.e. you have multiple copies of data at multiple places and all the machines can simultaneously process data.

avatar

Well there are many disadvantages of using replication factor 1 and we strongly do not recommend it for below reasons:

1. Data loss --> One or more datanode or disk failure will result in data loss.

2. Performance issues --> Having replication factor of more than 1 results in more parallelization.

3. Handling Failure --> With replication factor > 1, one or more Datanode doesn't result in job failure.