I have a need to migrate the data from the traditional EDW to Hive. Post migration of the data, i have to validate if the data is migrated successfully or not i.e. running count queries, min, max etc on the tables that are migrated. Ideally, this is required to be run on the entire data set on both the EDW and Hive for comparison. However, there are few factors that may limit this and the data validation queries can be run only on the subset of the data. This means that sampling is required to choose the representative subset. What tables have to be part of the representative subset is one and the other factor is the sample size itself. Now on determining the sample size, the key aspect is the SLA of the queries. So given the SLA, the number of queries, is it possible to arrive at the sample size? I referred the excel sheet at link and arrived at the following formula; This is definitely an approximation only but does this make sense at all?
S = ( ( X * R ) / Q ) * F
S - Sample Size in GB
X - SLA in minutes * 60
R - Disk Scan Rate of the cluster in GB/sec
Q - Number of queries
F - Adjusting Factor that account for the intermediate data reads (MR), computation within the hive query etc.
Hi @learninghuman I think that all makes sense, part of this will of course depend on the type of data you're migrating from your EDW, things like the table stats in ORCfile may also help if you have similar stats for your EDW.
Overall what you propose looks like a sane, sensible approach.