Member since
05-11-2019
14
Posts
0
Kudos Received
0
Solutions
07-09-2019
12:53 AM
2 Kudos
Yes that is correct, and the motivations/steps-to-use are reflected here too: https://www.cloudera.com/documentation/enterprise/6/latest/topics/cm_s3guard.html Note: On your point of 'load data from S3 into HDFS', it is better stated as simply 'read data from S3', where HDFS gets used as a transient storage (where/when required). There does not need to be a 'download X GiB data from S3 to HDFS first, only then begin jobs' step, as distributed jobs can read off of S3 via s3a:// URLs in the same way they do from HDFS hdfs://.
... View more
07-01-2019
03:34 PM
Hi Bill Thanks a lot for the long explanation. So two options exists (please correct me If I am wrong): 1) Use persistent master nodes and expand for extra temp workloads on demand 2) Alternatively, Have components (Hive, Navigator, etc) save their metadata into S3 or RDS so that all cluster can be teared off and then created from scracth repeately. (That is a better option for cost savings) Many Thanks, Cengiz
... View more
05-20-2019
04:18 AM
Thanks for this. I think, we can summarize this as follows: * If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved. * If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved. * If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.
... View more