Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Replicating hdfs data to another data center

Replicating hdfs data to another data center

Explorer

I have scheduled a replication via Cloudera Manager to replicate one of the hdfs data directory from one datacenter to another datacenter. It is working as expected, but we noticed one thing different.  When one of the user run the spark coalesce command on that directory to coalesce the hundreds files  into two files, it coalesced into 2 new files and deleted those hundreds of files.  After replication jobs runs, we noticed that those 2 new files are replicated, but those hundreds of files have not been deleted on target datacenter. 

 

Any idea as to why those hundreds of files have been deleted on source directory are not removed on Target directory with replication job.    

 

Note:  I have enabled the deleted policy (delete to trash) on Replication schedule job.  

 

x_trans_day = spark.read.parquet("/data/mart/cp/elixir/rx_trans/TRANSACTION_DATE_ID=20190102")
rx_trans_day.persist().count()
rx_trans_day.coalesce(2).write.format("parquet").save("/data/mart/cp/elixir/rx_trans/TRANSACTION_DATE_ID=20190102"mode="overwrite")
 
your help is very much appreciated.
 
Regards
~Uppal
Don't have an account?
Coming from Hortonworks? Activate your account here