Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

S3Guard Suggested to help fix Consistency

avatar
Explorer

Hi All

 

Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3?

 

Thanks

CK

 

1 ACCEPTED SOLUTION

avatar
Master Guru

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.

View solution in original post

1 REPLY 1

avatar
Master Guru

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.

Labels