05-19-2019 05:06 AM - edited 05-19-2019 06:17 AM
Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.
How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3?
If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...
05-19-2019 06:04 PM
05-20-2019 12:20 AM
05-20-2019 01:14 AM
05-20-2019 04:18 AM