Reply
Expert Contributor
Posts: 203
Registered: ‎01-25-2017

DistCp replication monitor

Hi,

 

I'm intersting to know if there is an exisiting jar to validate the copied files between farms, for example listing the files for the specific partition/partitions in the 2 farms and lists the files that doesn't match in terms of count and size.

Posts: 456
Topics: 1
Kudos: 58
Solutions: 37
Registered: ‎08-16-2016

Re: DistCp replication monitor

I don't know of any way other than scripting up something yourself. I have done similar things in the past. I usually just get a file count and total size, hdfs dfs -du -s /path/to/check/*.

It is worth mentioning that Distcp does a file size and checksum check as port of the process, if you don't use -skipcrccheck.

"involves file-size and checksum-comparisons "

The files skipped will be in the stdout of the client, the AM or container logs will contain any files that it couldn't find (i.e. hive staging data),

I have not had corruption issues with Distcp to date when used between HDFS and HDFS. It gets its list of files to be copied in the Map phase now so it can be difficult to accurately validate it unless there is no moving data.
Expert Contributor
Posts: 203
Registered: ‎01-25-2017

Re: DistCp replication monitor

I had an issue where I'm not skipping the checksum and saw different files
sizes on the 2 HDFS.

I have 2.8 million files in the sources and I'm using snapshots in the
distcp.

My primary cluster is active and all the time I have new data.
Posts: 456
Topics: 1
Kudos: 58
Solutions: 37
Registered: ‎08-16-2016

Re: DistCp replication monitor

I haven't used distcp against snapshot but it should then be a static source to work from. How are you comparing the files now? Is the source side larger or smaller than the destination?
Expert Contributor
Posts: 203
Registered: ‎01-25-2017

Re: DistCp replication monitor

I don't compare them today, but i checked it manually before a month and notice the difference.

 

Since i will a solution concept using the DistCp, it become important to monitor the replication as it also on different folders.

Posts: 456
Topics: 1
Kudos: 58
Solutions: 37
Registered: ‎08-16-2016

Re: DistCp replication monitor

I didn't ask when or for you to recheck them. I can only help determine if the issue lies elsewhere if I can understand how you are checking the files. Did you just use the hdfs commands du and count? Did you get the source at run time? Did you compare against the snapshot or the original file?
Expert Contributor
Posts: 203
Registered: ‎01-25-2017

Re: DistCp replication monitor

i just used du and count, i didn't tried this after i started to use the snapshot, i will work to right script that do this but was wondering if there a ready jar that do this

Posts: 456
Topics: 1
Kudos: 58
Solutions: 37
Registered: ‎08-16-2016

Re: DistCp replication monitor

There is not, has to be done manually. Check it based of the snapshot as that should be static and then if there is still an difference it can be attributed to distcp.
Highlighted
Expert Contributor
Posts: 203
Registered: ‎01-25-2017

Re: DistCp replication monitor

@mbigelow since i'm using the snapshot and i manage the snapshots cycle, i want to make sure that i don't miss to copy a data during the snapshots management.

 

I tried to check the snapshot s0 in the source and destination ( which is the state after the distco finish and before the next run) which should be equal in size and files count, but the destination always show larger size

Announcements