Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

AWS S3 bucket as a primary storage for HDFS

avatar
Expert Contributor

Hi Guys,

 

I was looking for some information about implementation of the S3 bucket as primary storage for HDFS. Has someone done sth like that? What are pro and cons of such solution?

 

Thanks,

Andrzej

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi,

I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.

 

As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.

 

There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.

 

Hope this helps.

View solution in original post

3 REPLIES 3

avatar
Mentor
Do you mean "Use of S3 instead of HDFS?" which would be a good idea for some cloud-env clusters such as those Cloudera Director helps run.

Keep an eye out for our upcoming 5.9 release too, where several further Cloud environment enhancements (incl. better S3 support) are forthcoming.

avatar
Expert Contributor

I would like to ingest all my data into S3 and make it a primary storage layer (not a backup). It would be a cloud-based env, e.g. deploy within Cloudera Director. Is it possible to specify during deploying a type of storage?

 

I would like to run YARN, SPARK, OOZIE jobs.

 

 

avatar
Expert Contributor

Hi,

I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.

 

As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.

 

There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.

 

Hope this helps.