Reply
Contributor
Posts: 41
Registered: ‎10-04-2017

S3 storage with Hadoop has only one replication?

Does S3 storage with hadoop (using S3 as hdfs storage) have only 1 replica of data?

Posts: 1,568
Kudos: 293
Solutions: 240
Registered: ‎07-31-2013

Re: S3 storage with Hadoop has only one replication

Amazon's S3 offers data protection by means of redundancy. You can read
their data durability/protection details here:
https://aws.amazon.com/s3/faqs/#data-protection. The "replication factor"
(or its equivalent) isn't explicitly controllable, but they do offer
certain feature options around it (covered on the same page).

P.s. This is more of a technicality note against the statement '(using S3
as hdfs storage)':

You cannot use S3 as a HDFS storage, HDFS is an independent system that
operates over disk devices. In other words, you cannot run NameNode and
DataNode daemons on top of S3, but you can use S3 in many places of the
Apache Hadoop ecosystem as an alternate choice of data storage,
substituting HDFS on the whole or in a hybrid manner.
Backline Customer Operations Engineer
Contributor
Posts: 41
Registered: ‎10-04-2017

Re: S3 storage with Hadoop has only one replication

Hi @Harsh J

 

Thanks for the info. We want to use our custom S3 as HDFS storage and want to know if we can have replication factor similar to HDFS on disks.

Posts: 1,568
Kudos: 293
Solutions: 240
Registered: ‎07-31-2013

Re: S3 storage with Hadoop has only one replication

If by custom S3 you mean alternative self-run services (such as Ceph, or
Swift for example) then the replication factor would depend on its
configuration. For example, Ceph allows configuring the resiliency factor
as noted at http://docs.ceph.com/docs/jewel/rados/operations/pools/ (it is
2 by default). You'll need to configure this outside of Hadoop.
Backline Customer Operations Engineer
Contributor
Posts: 41
Registered: ‎10-04-2017

Re: S3 storage with Hadoop has only one replication

[ Edited ]

Hi @Harsh J

 

Thanks for the insights. We are planning of a solution where we have two S3 storage layers one each in different Data centers and will configure each as one rack in hadoop. Thus trying to make use of hadoop rack-awareness to have copies of blocks in both the data centers. Not sure if this is gonna work but was just doing a lot of search on this solution.

Contributor
Posts: 41
Registered: ‎10-04-2017

Re: S3 storage with Hadoop has only one replication

And does your answer mean for AWS and Azure we will not have replication of data?
Posts: 1,568
Kudos: 293
Solutions: 240
Registered: ‎07-31-2013

Re: S3 storage with Hadoop has only one replication

Thank you for clarifying!

Rack awareness is a HDFS concept, since it is the NameNode that deals with
that logic. For non-HDFS storage systems you will need to consult their own
documentation for a similar feature - Apache Hadoop cannot help with that
if you choose not to use HDFS. Hopefully this makes it more clear on what's
a HDFS feature vs. how other storages relate to Hadoop.
Backline Customer Operations Engineer
Posts: 1,568
Kudos: 293
Solutions: 240
Registered: ‎07-31-2013

Re: S3 storage with Hadoop has only one replication

S3 and Azure both guarantee data durability, by means of redundancy
(replication). Its just that the level of redundancy is abstracted away,
and is not as straight forward as doing a 'hadoop fs -setrep' as you would
on HDFS. Consulting these services' guarantee documentation should help you
base your decision on if its adequate or not.
Backline Customer Operations Engineer
Contributor
Posts: 41
Registered: ‎10-04-2017

Re: S3 storage with Hadoop has only one replication

Hi @Harsh J

 

So if i can some it up, though we set replication factor as 3, since the data is stored in S3, it would be only 1 replica and that is why "setrep" doesnt work. But in the backend, AWS/Azure has redundant copies.

Highlighted
Contributor
Posts: 41
Registered: ‎10-04-2017

Re: S3 storage with Hadoop has only one replication

I thin ki need to make myself clear with the question. We will be using S3 as storage for datanodes and not namenodes.
Announcements