Support Questions

arkaprova · ‎11-17-2016

I am using sqoop for data transfer from RDBMS to HDFS, hive and HBase. How can I validate the data after sqoop command execution in HDFS, Hive and HBase? Please suggest. What will be the best strategy for data validation ?

gkeys · ‎11-18-2016

You should use

--validate

in your import or export statement (single tables only ... not for entire db import/export).

This validates that the number of rows are indentical btw source and target tables.

If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.

View solution in original post

gkeys · ‎11-18-2016

You should use

--validate

in your import or export statement (single tables only ... not for entire db import/export).

This validates that the number of rows are indentical btw source and target tables.

If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.

arkaprova · ‎11-22-2016

@Greg Keys

Thanks for your response. As per sqoop document --validate option does not work with Hive and HBase. What strategy should I take for Hive and HBase ? Please suggest.

gkeys · ‎11-23-2016

True. Here is the complete list of validation limitations. https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#validation

Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:

all-tables option
free-form query option
Data imported into Hive or HBase
table import with --where argument
incremental imports

For Hive

You can sqoop to HDFS with an external Hive table pointed to it. You can use the --validate feature here ... and if it passess validation, you can load to your Hive managed table from the landing zone external table using INSERT Select *.

Note that this validation only validates the number of rows.

If you want to validate every single value transferred from source to Hive, you would have to sqoop back to your source and compare checksums of the original and sqooped data in that source system

HBase

You can put a Phoenix front-end in front of your hbase table and do the same as with hive above.

Cloudera Community

Support Questions

Data validation after Sqoop command execution