About CK71

Harsh J · ‎07-09-2019

Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.

CK71 · ‎07-01-2019

Hi Bill Thanks a lot for the long explanation. So two options exists (please correct me If I am wrong): 1) Use persistent master nodes and expand for extra temp workloads on demand 2) Alternatively, Have components (Hive, Navigator, etc) save their metadata into S3 or RDS so that all cluster can be teared off and then created from scracth repeately. (That is a better option for cost savings) Many Thanks, Cengiz

CK71 · ‎05-20-2019

Thanks for this. I think, we can summarize this as follows: * If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved. * If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved. * If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.

Online	Offline
Last Visited	‎08-22-2020 05:43 AM

Member Since	‎05-11-2019 01:11 AM
Last Visited	‎08-22-2020 05:43 AM
Posts	14

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Best Practice for Temporary Hadoop Cluster in ...

Re: S3 loading into HDFS