Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

S3Guard Suggested to help fix Consistency

avatar
Explorer

Hi All

 

Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3?

 

Thanks

CK

 

1 ACCEPTED SOLUTION

avatar
Mentor

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.

View solution in original post

1 REPLY 1

avatar
Mentor

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.