Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Rising Star

Cloudera Data Platform Public Cloud recently introduced the ability to backup and restore datalake from a saved location. Specifically, the backup operation saves a full snapshot of data from all SDX services:

  • Atlas:
    • Audit events, saved in HBase tables
    • Lineage data, saved as Janus graph data in HBase tables
    • Edge, vertex, and full text indexes, saved in Solr collections
  • Ranger:
    • Audit logs, saved as a Solr collection
    • Permissions and tags, saved in RDBMS tables
  • HMS Metadata, saved in RDBMS tables

 

In this article, I will detail how to run backup and restore in CDP Public Cloud in AWS, via the CDP CLI.

Pre-Requisites

Stop operations that could affect backup

Make sure that no HMS affecting operations are running (e.g. creating a table from CDW or a datahub)

Go to your Datalake Cloudera Manager, and shut down:

  • Atlas
  • Ranger
  • HMS
Screen Shot 2020-10-20 at 4.59.58 PM.png

Make sure you have the proper IAM permissions

Datalake backup uses both the Ranger Audit Role and Datalake Admin Roles to write the backups (more details on these roles here)

Therefore, the policies attached to both the IAM role must give write permissions to the location of your backup.

Here is an example of a policy attached to the Ranger Audit Role:

 

 

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "FullObjectAccessUnderAuditDir",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::bckp-cdp-bucket/ranger/audit/*"
        },
        {
            "Sid": "FullObjectAccessUnderBackupDir",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::bckp-cdp-bucket/backups/*"
        },
        {
            "Sid": "LimitedAccessToDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads"
            ],
            "Resource": "arn:aws:s3:::bckp-cdp-bucket"
        }
    ]
}

 

Install and configure CDP CLI

This is fairly straightforward, and documented in your management console, under Help > Download CLI:

Screen Shot 2020-10-20 at 5.05.49 PM.png

Step 1: Running back-up

Initiate backup

 

$ cdp datalake backup-datalake --datalake-name bckp-cdp-dl --backup-location s3a://bckp-cdp-bucket/backups/
{
    "accountId": "558bc1d2-8867-4357-8524-311d51259233",
    "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
    "status": "IN_PROGRESS",
    "startTime": "2020-10-20 21:11:27.821",
    "endTime": "",
    "backupLocation": "s3a://bckp-cdp-bucket/backups/",
    "failureReason": "null"
}

 

Monitor backup

 

$ cdp datalake backup-datalake-status --datalake-name bckp-cdp-dl
{
    "accountId": "558bc1d2-8867-4357-8524-311d51259233",
    "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
    "userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
    "status": "IN_PROGRESS",
    "startTime": "2020-10-20 21:11:27.821",
    "endTime": "",
    "backupLocation": "s3a://bckp-cdp-bucket/backups/",
    "backupName": "",
    "failureReason": "null"
}

 

Step 2: Restoring backup

Initiate restore

 

$ cdp datalake restore-datalake --datalake-name  bckp-cdp-dl --backup-id 6c59a259-51ac-4db4-80d6-22f71f84cc4e
{
    "accountId": "558bc1d2-8867-4357-8524-311d51259233",
    "restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
    "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, DATABASE=IN_PROGRESS, EDGE_INDEX_COLLECTION_DELETE=IN_PROGRESS, RANGER_AUDITS_COLLECTION_DELETE=IN_PROGRESS, VERTEX_INDEX_COLLECITON_DELETE=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=IN_PROGRESS}",
    "status": "IN_PROGRESS",
    "startTime": "2020-10-20 21:15:11.757",
    "endTime": "",
    "backupLocation": "s3a://bckp-cdp-bucket/backups/",
    "failureReason": "null"
}

 

Monitor restore

 

$ cdp datalake restore-datalake-status --datalake-name bckp-cdp-dl
{
    "accountId": "558bc1d2-8867-4357-8524-311d51259233",
    "restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
    "backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
    "userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, EDGE_INDEX_COLLECTION_DELETE=SUCCESSFUL, VERTEX_INDEX_COLLECITON_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION_DELETE=SUCCESSFUL, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=SUCCESSFUL}",
    "status": "IN_PROGRESS",
    "startTime": "2020-10-20 21:15:11.757",
    "endTime": "",
    "backupLocation": "s3a://bckp-cdp-bucket/backups/",
    "failureReason": "null"
}

 

Note: you can also monitor these events in the CDP Control Plane:

Screen Shot 2020-10-20 at 5.17.29 PM.png

3,172 Views