Posts: 9
Registered: ‎05-11-2019
Accepted Solution

S3Guard Suggested to help fix Consistency

Hi All


Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3?





Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: S3Guard Suggested to help fix Consistency

[ Edited ]

Yes that is correct, and the motivations/steps-to-use are reflected here too:


Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.