Member since
05-11-2019
14
Posts
0
Kudos Received
0
Solutions
07-08-2019
08:19 AM
Hi All Is Cloudera suggesting to use S3Guard as a solution for the consistency problem in multi-step ETL? Cause in the reference architecture, the suggested option is to load data from S3 into HDFS and then write back to S3? Thanks CK
... View more
Labels:
- Labels:
-
HDFS
07-01-2019
03:34 PM
Hi Bill Thanks a lot for the long explanation. So two options exists (please correct me If I am wrong): 1) Use persistent master nodes and expand for extra temp workloads on demand 2) Alternatively, Have components (Hive, Navigator, etc) save their metadata into S3 or RDS so that all cluster can be teared off and then created from scracth repeately. (That is a better option for cost savings) Many Thanks, Cengiz
... View more
06-28-2019
01:46 PM
Hi All
Cloud deployment for the Hadoop stack on public Cloud provides cost/performance benefits for the temporary/batch workloads. But there is a need to keep master/management nodes only permanently up and running to maintain the meta-data for Navigator and all the other configurations made, etc..
What is the best practice to have a temporary cluster on the Public Cloud without caveats (for eg: not to loose lineage in the Navigator, the cluster has to remain permanent and how to create a minimum permanent cluster with data nodes added/removed daily?)
Thanks
CK
... View more
Labels:
- Labels:
-
Cloudera Navigator
05-20-2019
04:18 AM
Thanks for this. I think, we can summarize this as follows: * If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved. * If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved. * If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.
... View more
05-20-2019
12:20 AM
Hi Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL? Thanks CK
... View more
05-19-2019
05:06 AM
Hi All Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily. How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3? If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp... Thanks CK
... View more
Labels:
- Labels:
-
HDFS