Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Prerequisite:

  • Create an Account in S3 and get the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

AWS Command Line:

  • For the AWS command line to work have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY configured in ~/.aws/credentials. Something like:
    • [default]
    • aws_access_key_id=$AWS_ACCESS_KEY_ID
    • aws_secret_access_key=$AWS_SECRET_ACCESS_KEY
  • You might also want to set the region and output in ~/.aws/config. Something like:
    • [default]
    • region=us-west-2
    • output=json

Steps:

  • Create a bucket in S3. You can create it online on Amazon Console ( CreatingABucket.html ) or using the command line like: aws s3 mb $BUCKET_NAME
  • Modify the below properties in core-site.xml:
    • fs.defaultFS to s3a://$BUCKET_NAME
    • fs.s3a.access.key to $AWS_ACCESS_KEY_ID
    • fs.s3a.secret.key to $AWS_SECRET_ACCESS_KEY
    • fs.AbstractFileSystem.s3a.imp to org.apache.hadoop.fs.s3a.S3A (HADOOP-11262)
  • You might also want to set the below property in tez-site.xml if you need to run some Example jobs:
    • tez.staging-dir to hdfs://$NN_HOST:8020/tmp/$user_name/staging (TEZ-3276)
    • hive.exec.scratchdir to hdfs://$NN_HOST:8020/tmp/hive (For running Hive on Tez)
  • Restart HDFS,YARN, MAPREDUCE2

You should now be able to use S3 as the Default FileSystem.

10,450 Views
Comments
avatar

Hi @Namit Maheshwari Setting fs.defaultFS permanently to s3a is not recommended.

avatar

Dominika: I need to add: S3 is not a real filesystem. You cannot safely use AWS S3 it as a replacement for HDFS without a metadata consistency layer, and even then the eventual consistency of S3 updates and deletes cause problems.

you can safely use it as a source of data. To use as a direct destination of work takes care: consult the documentation specific to the version of Hadoop you are using before trying to make S3 the default filesystem.

Special case: third party object stores with full consistency. The fact that directory renames are not atomic may still cause problems with commit algorithms and the like, but the risk of corrupt data in the absence of failures is gone.