Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Explorer

It may often come to pass that you need to utilize data that doesn't exist in your Environment's Data Lake. This article will cover the 3 different scenarios where you need to access data that is outside of your Data Lake:

  1. A bucket in the same account as your CDP Environment
  2. A public bucket
  3. A bucker in a different account (than your CDP Environment)

Since the steps for variations 1 and 2 are the same, I'll cover them together.

Since IDBroker maps CDP users to explicit IAM roles. You need to perform the following to access even Public Buckets (unless you intend on interacting with them via URL).

Public Bucket / Bucket in the Same AccountFirst, identify the public bucket in question. For purposes of this article, I will use the "commoncrawl" public AWS bucket (https://registry.opendata.aws/commoncrawl/)

You'll need the bucket ARN (in this case "arn:aws:s3:::commoncrawl"), the ability to create IAM policies and roles, as well as the EnvironmentAdmin role in CDP.

Next, we'll create a policy for accessing this bucket. In our example, I have created a policy with this definition:

 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateJob",
                "s3:GetAccountPublicAccessBlock",
                "s3:HeadBucket",
                "s3:ListJobs"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowListingOfDataLakeFolder",
            "Effect": "Allow",
            "Action": [
                "s3:GetAccelerateConfiguration",
                "s3:GetAnalyticsConfiguration",
                "s3:GetBucketAcl",
                "s3:GetBucketCORS",
                "s3:GetBucketLocation",
                "s3:GetBucketLogging",
                "s3:GetBucketNotification",
                "s3:GetBucketPolicy",
                "s3:GetBucketPolicyStatus",
                "s3:GetBucketPublicAccessBlock",
                "s3:GetBucketRequestPayment",
                "s3:GetBucketTagging",
                "s3:GetBucketVersioning",
                "s3:GetBucketWebsite",
                "s3:GetEncryptionConfiguration",
                "s3:GetInventoryConfiguration",
                "s3:GetLifecycleConfiguration",
                "s3:GetMetricsConfiguration",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetObjectTagging",
                "s3:GetObjectVersion",
                "s3:GetObjectVersionAcl",
                "s3:GetObjectVersionTagging",
                "s3:GetReplicationConfiguration",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::commoncrawl",
                "arn:aws:s3:::commoncrawl/*"
            ]
        }
    ]
}

 

And attached it to a corresponding role:Screen Shot 2020-08-28 at 2.34.51 PM.png

A few notes about this role:

  1. It's a role much like your "Datalake Admin Role", i.e. its trusted entity type is "Another AWS Account" where the account number is your account number.
  2. I've attached the dynamodb policy here as well since my ultimate goal is to be able to interact with this bucket via the hdfs CLI (and that involves the use of s3Guard) No longer needed. S3Guard is no longer needed.
  3. Here is the trust relationship (the same as your "Datalake Admin Role)

 

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Your Acct Number>:role/<Your ID Broker Role>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

Now, we can head back to CDP and update our IDBroker Mappings.

From your environment overview, click "Actions" and "Manage Access".

Screen Shot 2020-08-28 at 2.41.02 PM.pngClick on "ID Broker Mappings" and then click "Edit"Screen Shot 2020-08-28 at 2.42.21 PM.png

From here, you can add the user (or group) you'd like to give access to the bucket in question as well as the role we just created:

Screen Shot 2020-08-28 at 2.44.28 PM.png

Click "Save & Sync" and you're good to go!

You can now interact with this bucket via the hdfs CLI (here's an example from my Data Lake "master" node)

Screen Shot 2020-08-28 at 12.33.51 PM.png

Bucket in a Different Account

But what about a bucket that's in a different account, you say? The process is very similar...

(Thanks to Nathan Anthony (@na)  for the inspiration here)

Again, first, identify the bucket. I'll be using a bucket called "perro-ext" in an account that is different than the account that hosts my CDP environment. I will refer to the accounts as "CDP Account" and "External Account".

Next, create a policy (in the CDP Account) for accessing the bucket in the External Account:

 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::perro-ext",
                "arn:aws:s3:::perro-ext/*"
            ]
        }
    ]
}

 

And attach it to a role in the same way as above:Screen Shot 2020-08-28 at 2.54.26 PM.png

With the exact same trust policy as above:

 

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<CDP Account Number>:role/<Your ID Broker Role>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

Here's where we briefly divert from the first scenario. We need to edit the Bucket Policy on the bucket in the External Account.

In the External Account, head to the bucket and click on "Permissions" and "Bucket Policy"

Screen Shot 2020-08-28 at 2.59.12 PM.png

Here's the policy I used:

 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Your CDP Account>:role/<Your CDP Account Role>"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListMultipartUploadParts",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::perro-ext/*",
                "arn:aws:s3:::perro-ext"
            ]
        }
    ]
}

 

(The steps are loosely based on the first resolution method here).

Now we're back to the same steps as the method above. Head back to the ID Broker mappings tab in Manage Access under your Environment and add the role to the user or group of your choice:

Screen Shot 2020-08-28 at 3.04.47 PM.png

"Save and Sync" and now you can access the bucket in the "External Account" from the hdfs CLI:

Screen Shot 2020-08-28 at 12.58.15 PM.png

But What About RAZ?

If you have RAZ enabled in your environment, the steps are largely the same, but you must add a policy to access the bucket in question to the RAZ role. The Bucket Policy (if reading/writing to a bucket in a different account) is the same. 

And don't forget to update your cm_s3 ranger policies!

DISCLAIMER: This article is contributed by an external user. The steps may not be verified by Cloudera and may not be applicable for all use cases and may be very specific to a particular distribution. Please follow with caution and at your own risk. If needed, raise a support case to get confirmation.

2,315 Views