Created on 10-16-2016 03:05 AM - last edited on 10-16-2016 05:16 AM by cjervis
Hi Guys,
I was looking for some information about implementation of the S3 bucket as primary storage for HDFS. Has someone done sth like that? What are pro and cons of such solution?
Thanks,
Andrzej
Created 10-16-2016 10:13 AM
Hi,
I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.
As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.
There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.
Hope this helps.
Created 10-16-2016 08:10 AM
Created 10-16-2016 09:23 AM
I would like to ingest all my data into S3 and make it a primary storage layer (not a backup). It would be a cloud-based env, e.g. deploy within Cloudera Director. Is it possible to specify during deploying a type of storage?
I would like to run YARN, SPARK, OOZIE jobs.
Created 10-16-2016 10:13 AM
Hi,
I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.
As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.
There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.
Hope this helps.