Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

while importing the data from database to hdfs using sqoop how to validate the data using mapreduce? before processing by hive?

while importing the data from database to hdfs using sqoop how to validate the data using mapreduce? before processing by hive?

New Contributor
 
2 REPLIES 2

Re: while importing the data from database to hdfs using sqoop how to validate the data using mapreduce? before processing by hive?

Contributor

You can use to validate option with your import and export command to validate number of rows between source and target tables (works only for single table).

If you want to validate every single value,you can compare checksums of the original and sqooped data in that source system.

Re: while importing the data from database to hdfs using sqoop how to validate the data using mapreduce? before processing by hive?

That's a very broad question. The description of data validation can vary anywhere between a row count validation to actual data validation, that is comparing the source data to sqooped data. Let's discuss a couple of them.

1. Count validation

You can use --validate option in your sqoop imports/exports to perform a count validation. It can even handle the error margin between source and destination. For example, there may be 1,000 new records inserted into your RDBMS table between the time your Sqoop ingestion initiated and it finished. You did ingested all the available records at the time when you submitted your job but since the RDBMS can be a possible storage layer for your application, you may have records inserted, updated and deleted 24X7. So your validation between source and destination will not pass as is. Taking this into consideration, the --validate option gives you the option of having a percentage margin(eg. +/- 2%) or absolute record count margin(eg. 1000 records) between source and destination.

2. Data validation

If row count is not the only validation you want to do, you can ingest the data into HDFS and call it the landing zone. This data is not available to the end user for any operations but is there only for your validation. You can use whatever validation requirements you have using some automated scripts and once the validation pass, you can move the data to "consumption layer", where the end customer can fire the queries or do whatever operations that want to on that data. As I said, there can be multiple ways of doing the validation depending on your requirements/use case. You may use checksum validation as mentioned by @rtrivedi or may use the record to record comparison, which of course is not a good idea.

Hope that helps!