Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)

Prerequisite:

  • Create an Account in S3 and get the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

AWS Command Line:

  • For the AWS command line to work have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY configured in ~/.aws/credentials. Something like:
    • [default]
    • aws_access_key_id=$AWS_ACCESS_KEY_ID
    • aws_secret_access_key=$AWS_SECRET_ACCESS_KEY
  • You might also want to set the region and output in ~/.aws/config. Something like:
    • [default]
    • region=us-west-2
    • output=json

Steps:

  • Create a bucket in S3. You can create it online on Amazon Console ( CreatingABucket.html ) or using the command line like: aws s3 mb $BUCKET_NAME
  • Modify the below properties in core-site.xml:
    • fs.defaultFS to s3a://$BUCKET_NAME
    • fs.s3a.access.key to $AWS_ACCESS_KEY_ID
    • fs.s3a.secret.key to $AWS_SECRET_ACCESS_KEY
    • fs.AbstractFileSystem.s3a.imp to org.apache.hadoop.fs.s3a.S3A (HADOOP-11262)
  • You might also want to set the below property in tez-site.xml if you need to run some Example jobs:
    • tez.staging-dir to hdfs://$NN_HOST:8020/tmp/$user_name/staging (TEZ-3276)
    • hive.exec.scratchdir to hdfs://$NN_HOST:8020/tmp/hive (For running Hive on Tez)
  • Restart HDFS,YARN, MAPREDUCE2

You should now be able to use S3 as the Default FileSystem.

4,851 Views
Comments

Hi @Namit Maheshwari Setting fs.defaultFS permanently to s3a is not recommended.

Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems.

you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem.

Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎03-07-2017 02:22 AM
Updated by:
 
Contributors
Top Kudoed Authors