Support Questions
Find answers, ask questions, and share your expertise

Hive Replication to S3

Highlighted

Hive Replication to S3

Hi Community,

In my cluster, I'm using Hive Replication to S3 to backup databases on daily-basis. 

I was referring to the documentation and couldn't find anything on whether any database/table/file in user's personal directory is also deleted from S3 if it is dropped from the cluster. 

 

From the documentation,

  • If you configure replication of a Hive table and then later drop that table, the table remains on the destination cluster. The table is not dropped when subsequent replications occur.

Can anyone please confirm if the above point is applicable when replicating to S3?

What approach is taken / recommended to keep the cluster and backup on S3 in-sync?

 

Thank you,

Snehasish

2 REPLIES 2
Highlighted

Re: Hive Replication to S3

Cloudera Employee
Hi Snehasish,

From your post, I see that you are referring to

Please refer these links for more information on S3 BDR:
>> Our Blog on S3 and BDR
https://blog.cloudera.com/using-amazon-s3-with-cloudera-bdr/
>> Hive Replication to/from Cloud storage
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_bdr_hive_replication.html#concept...

"From the documentation,

If you configure replication of a Hive table and then later drop that table, the table remains on the destination cluster. The table is not dropped when subsequent replications occur."

To answer your query, That depends on How the delete policy of the Replication Job is set to and also the Permission attributes set in the Advanced configuration of the replication schedule,

Refer to the "Hive Replication to/from Cloud storage" on both the attributes for better understanding and write back to us if you need further clarification.

Regards
V

Re: Hive Replication to S3

Hi @vaishaakb 

Thank you for the reply. 

I understand that dropping table / database in the cluster doesn't replicate it back to the cloud back-up. 

We have a usecase where each month our internal customers create some tables / databases, work on them for few days and then drop them once they are done. As a result of this, S3 bucket has many abandoned databases which is growing day-by-day. 

It would be really helpful if you could advise a way to keep the S3 bucket in-sync with the Hive.

 

Thank you,

Snehasish