- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Data validation after Sqoop command execution
Created ‎11-17-2016 10:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using sqoop for data transfer from RDBMS to HDFS, hive and HBase. How can I validate the data after sqoop command execution in HDFS, Hive and HBase? Please suggest. What will be the best strategy for data validation ?
Created ‎11-18-2016 03:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should use
--validate
in your import or export statement (single tables only ... not for entire db import/export).
This validates that the number of rows are indentical btw source and target tables.
If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.
Created ‎11-18-2016 03:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should use
--validate
in your import or export statement (single tables only ... not for entire db import/export).
This validates that the number of rows are indentical btw source and target tables.
If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.
Created ‎11-22-2016 05:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response. As per sqoop document --validate option does not work with Hive and HBase. What strategy should I take for Hive and HBase ? Please suggest.
Created ‎11-23-2016 03:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
True. Here is the complete list of validation limitations. https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#validation
Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:
- all-tables option
- free-form query option
- Data imported into Hive or HBase
- table import with --where argument
- incremental imports
For Hive
You can sqoop to HDFS with an external Hive table pointed to it. You can use the --validate feature here ... and if it passess validation, you can load to your Hive managed table from the landing zone external table using INSERT Select *.
Note that this validation only validates the number of rows.
If you want to validate every single value transferred from source to Hive, you would have to sqoop back to your source and compare checksums of the original and sqooped data in that source system
HBase
You can put a Phoenix front-end in front of your hbase table and do the same as with hive above.
