I want to implement two techniques in Hadoop.
Fault tolerance of tasks via checkpointing
Fault tolerance of tasks via replication.
when we use this , then if a task fail it should start from the last checkpoint.
When we implement this, then a task will start at another node where replica is stored.
Most of it depends on what you are trying to do exactly.
There is checkpointing available in Spark:
HDFS is by itself replicated and fault tolerant.
This is also available at component level but implementation varies. For example:
These are some examples. You can search how fault tolerance/replication is implemented in other components.