About CK71

CK71 · ‎07-08-2019

Hi All Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3? Thanks CK

CK71 · ‎07-01-2019

Hi Bill Thanks a lot for the long explanation. So two options exists (please correct me If I am wrong): 1) Use persistent master nodes and expand for extra temp workloads on demand 2) Alternatively, Have components (Hive, Navigator, etc) save their metadata into S3 or RDS so that all cluster can be teared off and then created from scracth repeately. (That is a better option for cost savings) Many Thanks, Cengiz

CK71 · ‎06-28-2019

Hi All Cloud deployment for the Hadoop stack on public Cloud provides cost/performance benefits for the temporary/batch workloads. But there is a need to keep master/management nodes only permanently up and running to maintain the meta-data for Navigator and all the other configurations made, etc.. What is the best practice to have a temporary cluster on the Public Cloud without caveats (for eg: not to loose lineage in the Navigator, the cluster has to remain permanent and how to create a minimum permanent cluster with data nodes added/removed daily?) Thanks CK

CK71 · ‎05-20-2019

Thanks for this. I think, we can summarize this as follows: * If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved. * If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved. * If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.

CK71 · ‎05-20-2019

Hi Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL? Thanks CK

CK71 · ‎05-19-2019

Hi All Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily. How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3? If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp... Thanks CK

Online	Offline
Last Visited	‎08-22-2020 05:43 AM

Member Since	‎05-11-2019 01:11 AM
Last Visited	‎08-22-2020 05:43 AM
Posts	14

Cloudera Community

S3Guard Suggested to help fix Consistency

Re: Best Practice for Temporary Hadoop Cluster in ...

Best Practice for Temporary Hadoop Cluster in Clou...

Re: S3 loading into HDFS

Re: S3 loading into HDFS

S3 loading into HDFS