Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

S3 loading into HDFS

avatar
Explorer

Hi All

 

Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.

 

How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3? 

If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...

 

Thanks

CK

1 ACCEPTED SOLUTION

avatar
Mentor
You can apply the queries directly on that external table. Hive will use
HDFS for any transient storage it requires as part of the query stages.

Of course, if it is a set of queries overall, you can also store all the
intermediate temporary tables on HDFS in the way you describe, but the
point am trying to make is that you do not need to copy the original data
as-is, just allow Hive to read off of S3/write into S3 at the points that
matter.

View solution in original post

4 REPLIES 4

avatar
Mentor
You do not need to pull files into HDFS as a step in your processing, as
CDH provides inbuilt connectors to pull input/write output directly from S3
storage (s3a:// URIs, backed by some configurations that provide
credentials and targets).

This page is a good starting reference to setting up S3 access over Cloud
installations:
https://www.cloudera.com/documentation/director/latest/topics/director_s3_object_storage.html
-
make sure to checkout the page links from the opening paragraph too.

avatar
Explorer
Hi

Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL?

Thanks
CK

avatar
Mentor
You can apply the queries directly on that external table. Hive will use
HDFS for any transient storage it requires as part of the query stages.

Of course, if it is a set of queries overall, you can also store all the
intermediate temporary tables on HDFS in the way you describe, but the
point am trying to make is that you do not need to copy the original data
as-is, just allow Hive to read off of S3/write into S3 at the points that
matter.

avatar
Explorer
Thanks for this. I think, we can summarize this as follows:

* If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved.
* If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved.
* If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.