Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

S3 loading into HDFS

avatar
Explorer

Hi All

 

Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.

 

How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3? 

If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...

 

Thanks

CK

1 ACCEPTED SOLUTION

avatar
Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
4 REPLIES 4

avatar
Mentor
You do not need to pull files into HDFS as a step in your processing, as
CDH provides inbuilt connectors to pull input/write output directly from S3
storage (s3a:// URIs, backed by some configurations that provide
credentials and targets).

This page is a good starting reference to setting up S3 access over Cloud
installations:
https://www.cloudera.com/documentation/director/latest/topics/director_s3_object_storage.html
-
make sure to checkout the page links from the opening paragraph too.

avatar
Explorer
Hi

Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL?

Thanks
CK

avatar
Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Explorer
Thanks for this. I think, we can summarize this as follows:

* If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved.
* If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved.
* If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.