question S3 loading into HDFS in Support Questions

S3 loading into HDFS

CK71 — Sun, 19 May 2019 13:17:27 GMT

Hi All

Cloudera suggests as best practice using S3 storage only for initial and final storage. The intermediate files will need to be stored in HDFS... In that case, we are still using HDFS but the cluster will only run during the batch ETL and then tore off daily.

How we can pull S3 data into HDFS for each batch ETL job? and then store back the final results to S3?

If Cloudera means to use distcp, how that might work for ETL batch jobs each time? It did not make sense to me using distcp...

Thanks

Re: S3 loading into HDFS

Harsh J — Mon, 20 May 2019 01:04:15 GMT

You do not need to pull files into HDFS as a step in your processing, as
CDH provides inbuilt connectors to pull input/write output directly from S3
storage (s3a:// URIs, backed by some configurations that provide
credentials and targets).

This page is a good starting reference to setting up S3 access over Cloud
installations:
https://www.cloudera.com/documentation/director/latest/topics/director_s3_object_storage.html
-
make sure to checkout the page links from the opening paragraph too.

Re: S3 loading into HDFS

CK71 — Mon, 20 May 2019 07:20:15 GMT

Hi

Thanks for that. So, I assume, I will have to create an external hive table pointing to S3 and copy the data from there into an other internal Hive table on HDFS to start the ETL?

Thanks
CK

Re: S3 loading into HDFS

Harsh J — Mon, 20 May 2019 08:14:15 GMT

You can apply the queries directly on that external table. Hive will use
HDFS for any transient storage it requires as part of the query stages.

Of course, if it is a set of queries overall, you can also store all the
intermediate temporary tables on HDFS in the way you describe, but the
point am trying to make is that you do not need to copy the original data
as-is, just allow Hive to read off of S3/write into S3 at the points that
matter.

Re: S3 loading into HDFS

CK71 — Mon, 20 May 2019 11:18:15 GMT

Thanks for this. I think, we can summarize this as follows:

* If only External Hive Table is used to process S3 data, the technical issues regarding consistency, scalable meta-data handling would be resolved.
* If External & Internal Hive Tables are used in combination to process S3 data, the technical issues regarding consistency, scalable meta-data handling and data locality would be resolved.
* If Spark alone is used on top of S3, the technical issues regarding consistency with (in memory processing), scalable meta-data handling would be resolved. As Spark will perform transient storage in memory and only read the initial data from S3 and write back the result.