I have a requirement to migrate from the Teradata environment to Hadoop environment (for the sake of simplicity assume that the data is stored in HDFS). Apart from the data, there are also PLSQLs to be migrated to the Hadoop.
Is there any recommendation to the data validation strategy post migration? I may not want to validate on the entire data set as it may be huge and validation itself would be time consuming. Is there a better way of doing this by means of sampling etc.? If so, what and how would be sampled?
You could write a bunch of sqoop jobs to import tables (into hive) and validate them by writing a beeline/hive script to obtain count. If there are a lot of tables, then you can try batch these statements.
Another way could be to use an open source ETL tool to import and then add a task to validate the table output from source and target. This will be slow as the transfer rate would be too slow, but it will do the job.
This may not scale. I am looking at solutions by means of sampling or any other strategy which gives atleast 95% correctness of validation (in case of sampling).
Depends on how rigorous you need the result to be and whether you're needing to validate the migration of the data or the migration of the code (stored procs) = the data processed after the migration. With a 95% tolerance target, you've got room to move.
Our best case migration involves a parallel run on both old and new systems. Then we compare record counts by some grouping dimension on both systems for a general level of confidence. If the record counts match, that gives us a good measure of confidence.
You can also sample record subsets and compare against the other system. A full outer join will provide a count of mismatches on both the new and old system. For larger data volumes, assuming well distributed data, 10K - 20K record samples will give reasonable coverage. Again, depending on the relative capacity of the Teradata and Hadoop systems, you'll probably get the fastest result pulling the samples from the Hadoop system into the Teradata system and performing the outer joins and aggregations there.
Thanks for the response.
I think i can break down the question into two parts.
1. What and how to sample?
For the comparison after the migration, how to do effective sampling from both TD and hadoop so that we are comparing closer sample data sets? Are you talking about random sampling or something else?
(Randam sampling may not even give the right data sets for comparison on two different systems).
2. What data quality operations to be performed on the two data sets?
I can think of the following,
a. Count of records.
b. Number of nulls on each column
c. Number of empty values on each column
d. Number of unique values on each column
e. Column level stats like min, max etc based on the column/data type
f. Top N commonly occurring values on each column
g. Full outer join may also give more insights.
Since we are talking about samples, there has to be predefined threshold confidence level.
In regards to sampling, I would perform a random sample on the Hadoop side and upload that sample to Teradata - 10K's of records as opposed to millions or billions. Then perform the full outer join on the Teradata side - small data set joined against large dataset. If the primary key of the data set == the Teradata primary index, you should get reasonable join performance. Depending on capacity, you could also go the other way with sampling from Teradata and comparing to the full Hadoop dataset. There are a number of online tools you can use to generate sample sizes for different confidence levels.
The processing capacity required by either system to generate a sample set will depend on how truly random you need the sample to be or if a sampling 'short-cut' technique will work. For example, on one of our systems we have a multi-billion row table that's very well distributed. Selecting the top-n rows from each block or amp gives a useful sample, though not technically a truly statistically random sample.
Assuming you're not concerned with validating field values, but only record counts, e.g. you trust the field transforms, once you have the joined sample, you then have the following pieces of information (based on key matching):
Depending on how many different datasets or schemas with which you're dealing, the key matching technique is something that can be generated more easily pragmatically. You just need to specify key fields, as opposed to having to enumerate all fields.
Assuming you included the fields in the subset, not just the keys, you could then use this matched subset to also validate field transformations. Any of the measures you've listed above could be used to get confidence between the two data sets?
Not very clear on the key matching technique that you described. Can you please elaborate on what you generate and compare between TD and Hadoop with examples? And also the matched subset. What if the random sampling from TD and Hadoop are far apart?