Created 03-15-2018 04:41 PM
Created 03-15-2018 05:21 PM
You can use to validate option with your import and export command to validate number of rows between source and target tables (works only for single table).
If you want to validate every single value,you can compare checksums of the original and sqooped data in that source system.
Created 03-15-2018 05:44 PM
That's a very broad question. The description of data validation can vary anywhere between a row count validation to actual data validation, that is comparing the source data to sqooped data. Let's discuss a couple of them.
1. Count validation
You can use --validate option in your sqoop imports/exports to perform a count validation. It can even handle the error margin between source and destination. For example, there may be 1,000 new records inserted into your RDBMS table between the time your Sqoop ingestion initiated and it finished. You did ingested all the available records at the time when you submitted your job but since the RDBMS can be a possible storage layer for your application, you may have records inserted, updated and deleted 24X7. So your validation between source and destination will not pass as is. Taking this into consideration, the --validate option gives you the option of having a percentage margin(eg. +/- 2%) or absolute record count margin(eg. 1000 records) between source and destination.
2. Data validation
If row count is not the only validation you want to do, you can ingest the data into HDFS and call it the landing zone. This data is not available to the end user for any operations but is there only for your validation. You can use whatever validation requirements you have using some automated scripts and once the validation pass, you can move the data to "consumption layer", where the end customer can fire the queries or do whatever operations that want to on that data. As I said, there can be multiple ways of doing the validation depending on your requirements/use case. You may use checksum validation as mentioned by @rtrivedi or may use the record to record comparison, which of course is not a good idea.
Hope that helps!