Created 10-18-2017 11:17 AM
Does using S3 as storage layer in Hadoop has the same replication factor (default 3)??? I see various blogs telling when we distcp data from HDFS to S3, replication will be ignored and only 1 replica will be stored. Is that True?
Created 10-18-2017 01:27 PM
@Rajesh Reddy At this time S3 cannot be used as an outright replacement for an HDFS deployment; so your data-lifecycle you describe would be required to be scripted and such. Today jobs that have multiple map-reduce stages will output data to HDFS for the second stage. S3 can be used as a Source and Sink for input and final datasets but is not used for the intermediate data. Also Rack-Awareness is a HDFS Block based behaviour and would not apply to S3.
If you want your dataset in S3 to also be located in another AWS region you could setup S3 Cross-Region Replication.
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
Created 10-18-2017 12:31 PM
@Rajesh Reddy That is correct, Replication Factor is an HDFS specific setting. When you load your data into S3 your utilising however S3 performs robust durability measures. In the document linked below they claim a "99.999999999% of durability for objects stored within a given region." S3 still redundantly stores multiple copies over multiple regions for durability and performs many of the same actions that HDFS does in terms of detecting corrupt replica's and replacing them.
https://d0.awsstatic.com/whitepapers/protecting-s3-against-object-deletion.pdf
http://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html
Created 10-18-2017 01:08 PM
Hi @Joseph Niemiec Thanks for the insights. We are planning of a solution where we have two custom S3 storage layers one each in different Data centers and will configure each as one rack in hadoop. Thus trying to make use of hadoop rack-awareness to have copies of blocks in both the data centers. Not sure if this is gonna work but was just doing a lot of search on this solution.
Created 10-18-2017 02:03 PM
@Rajesh Reddy explained in my above comment HDFS is used for intermediate storage of many datasets between stages depending on the workload engines being used. Additionally if these engines ie MapReduce or spark make use of a distributed cache for jars, etc they will be pushed to HDFS not S3.
The link you provided talks about using S3 as a source and sink for a dataset, it does not describe replacing the entire defaultFS for HDFS with s3. The first page of the guide you linked also states this. -
"These connectors are not a replacement for HDFS and cannot be used as a replacement for HDFS defaultFS
."
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_cloud-data-access/content/intro.html
Created 10-18-2017 02:14 PM
@Joseph Niemiec Well i think i need to more clear on my earlier question. We will be using S3 as data node storage and not namenode storage.
Created 10-18-2017 02:21 PM
No your clear, you cannot do what your saying. the defaultFS cannot be replaced by the S3 connector as your attempting to describe. You still need a cluster with HDFS deployed even if its a much lower volume of HDFS space. And then your jobs can 'target' the larger datasets you have stored on S3. But HDFS is still required as all the things above and as the documentation states the connectors cannot be used as an replacment for HDFS.
@Rajesh Reddy Even if no data is stored in HDFS and everything is stored S3 you will still require a defaultFS for the API layer and for how processing engines work today. Drop-in replacements are block storage not object like S3 and include products like Isilon and Spectrum Scale.
Created 10-18-2017 02:24 PM
So, if i we have some amount of disks in datanodes can we leverage the solution?
Created 10-18-2017 02:31 PM
That is correct @Rajesh Reddy, think of HDFS as a performance layer for workloads to do the inbetween work, and S3 as the place for datasets to live long term. You can then reduce the storage on your datanodes becuase its only for intermetidate processing.
Created 10-18-2017 01:27 PM
@Rajesh Reddy At this time S3 cannot be used as an outright replacement for an HDFS deployment; so your data-lifecycle you describe would be required to be scripted and such. Today jobs that have multiple map-reduce stages will output data to HDFS for the second stage. S3 can be used as a Source and Sink for input and final datasets but is not used for the intermediate data. Also Rack-Awareness is a HDFS Block based behaviour and would not apply to S3.
If you want your dataset in S3 to also be located in another AWS region you could setup S3 Cross-Region Replication.
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
Created 10-18-2017 01:36 PM
@Joseph Niemiec HDP blog below tells we can use per bucket settings to access data across globe which i assume is from different regions. If you dont mind, could you please elaborate your comment "S3 can't be used as replacement for HDFS"??