- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
AWS S3 bucket as a primary storage for HDFS
- Labels:
-
HDFS
Created on
‎10-16-2016
03:05 AM
- last edited on
‎10-16-2016
05:16 AM
by
cjervis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guys,
I was looking for some information about implementation of the S3 bucket as primary storage for HDFS. Has someone done sth like that? What are pro and cons of such solution?
Thanks,
Andrzej
Created ‎10-16-2016 10:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.
As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.
There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.
Hope this helps.
Created ‎10-16-2016 08:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Keep an eye out for our upcoming 5.9 release too, where several further Cloud environment enhancements (incl. better S3 support) are forthcoming.
Created ‎10-16-2016 09:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would like to ingest all my data into S3 and make it a primary storage layer (not a backup). It would be a cloud-based env, e.g. deploy within Cloudera Director. Is it possible to specify during deploying a type of storage?
I would like to run YARN, SPARK, OOZIE jobs.
Created ‎10-16-2016 10:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.
As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.
There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.
Hope this helps.
