Created on 05-19-2019 05:06 AM - edited 05-19-2019 06:17 AM
Hi All
Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.
How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3?
If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...
Thanks
CK
Created 05-20-2019 01:14 AM
Created 05-19-2019 06:04 PM
Created 05-20-2019 12:20 AM
Created 05-20-2019 01:14 AM
Created 05-20-2019 04:18 AM