Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data validation after Sqoop command execution

avatar
Expert Contributor

I am using sqoop for data transfer from RDBMS to HDFS, hive and HBase. How can I validate the data after sqoop command execution in HDFS, Hive and HBase? Please suggest. What will be the best strategy for data validation ?

1 ACCEPTED SOLUTION

avatar
Guru

You should use

--validate

in your import or export statement (single tables only ... not for entire db import/export).

This validates that the number of rows are indentical btw source and target tables.

If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.

View solution in original post

3 REPLIES 3

avatar
Guru

You should use

--validate

in your import or export statement (single tables only ... not for entire db import/export).

This validates that the number of rows are indentical btw source and target tables.

If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.

avatar
Expert Contributor
@Greg Keys

Thanks for your response. As per sqoop document --validate option does not work with Hive and HBase. What strategy should I take for Hive and HBase ? Please suggest.

avatar
Guru

True. Here is the complete list of validation limitations. https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#validation

Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:

  • all-tables option
  • free-form query option
  • Data imported into Hive or HBase
  • table import with --where argument
  • incremental imports

For Hive

You can sqoop to HDFS with an external Hive table pointed to it. You can use the --validate feature here ... and if it passess validation, you can load to your Hive managed table from the landing zone external table using INSERT Select *.

Note that this validation only validates the number of rows.

If you want to validate every single value transferred from source to Hive, you would have to sqoop back to your source and compare checksums of the original and sqooped data in that source system

HBase

You can put a Phoenix front-end in front of your hbase table and do the same as with hive above.