Created 11-17-2016 10:02 AM
I am using sqoop for data transfer from RDBMS to HDFS, hive and HBase. How can I validate the data after sqoop command execution in HDFS, Hive and HBase? Please suggest. What will be the best strategy for data validation ?
Created 11-18-2016 03:16 PM
You should use
--validate
in your import or export statement (single tables only ... not for entire db import/export).
This validates that the number of rows are indentical btw source and target tables.
If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.
Created 11-18-2016 03:16 PM
You should use
--validate
in your import or export statement (single tables only ... not for entire db import/export).
This validates that the number of rows are indentical btw source and target tables.
If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.
Created 11-22-2016 05:49 AM
Thanks for your response. As per sqoop document --validate option does not work with Hive and HBase. What strategy should I take for Hive and HBase ? Please suggest.
Created 11-23-2016 03:34 PM
True. Here is the complete list of validation limitations. https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#validation
Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:
For Hive
You can sqoop to HDFS with an external Hive table pointed to it. You can use the --validate feature here ... and if it passess validation, you can load to your Hive managed table from the landing zone external table using INSERT Select *.
Note that this validation only validates the number of rows.
If you want to validate every single value transferred from source to Hive, you would have to sqoop back to your source and compare checksums of the original and sqooped data in that source system
HBase
You can put a Phoenix front-end in front of your hbase table and do the same as with hive above.