Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

S3Guard Suggested to help fix Consistency

Solved Go to solution
Highlighted

S3Guard Suggested to help fix Consistency

Explorer

Hi All

 

Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3?

 

Thanks

CK

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: S3Guard Suggested to help fix Consistency

Master Guru

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.

1 REPLY 1

Re: S3Guard Suggested to help fix Consistency

Master Guru

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html

 

Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.