Created on 10-12-2015 10:30 AM - edited 08-17-2019 02:11 PM
Introduction
This wiki page describes the script validationHiveTablesLinuxMode.sh (GitHub: https://github.com/sourygnahtw/hadoopUtils/blob/master/scripts/hive/validationHiveTablesLinuxMode.sh).
The purpose of the script is to compare one Hive table with another, typically to validate a table generated by a Hive transformation against a source-of-truth (sqooped) table.
My current project is the typical Hadoop project that consists of offloading a ETL process currently done in a relational database (SQLserver) to a Hadoop process.
That involves rewriting all the (many) SQL queries in the ETL process to HiveQL queries.
Some criteria of success for such project are:
So, getting sure that the HiveQL queries written in such ETL offloading project is a key subject. Hence the need of a validation tool.
At the beginning of the current project, no validation tool existed and we were doing all the validation "manually": looking at the data in the SQLserver table and checking that they were the same in the generated Hive tables. Due to the high number of columns in each table and the number of rows (and also the fact that I already wear glasses...), this approach was quite limited.
I found no tool on the internet to do such validation. So the script described here was the first tool to do the desired validation (but not the last: Ryan Merriman, another fellow worker on the same project, has developed in java a better validation tool, although a bit more complicated to deploy).
Let's suppose that we want to validate the table "footable" and that this table is in the Hive database "bardatabase".
The process to do the validation is:
Let's show the result of a similar basic execution:
[sourygna@donzelot ~]$ ./validationHiveTablesLinuxMode.sh input_table1 #### Comparing tovalidate.input_table1 with sot.input_table1 ERROR: the number of rows is different from orig table (7868 rows) to result table (7869 rows) Number of differences: 2 (out of 7868 rows) To see the differences visually: vimdiff -o -c "windo set wrap foldcolumn=0" /data/sourygna/validation/tovalidate_sot/input_table1/{orig,result}/sorted.gz
The first line on the standard output describes the comparison performed by the script (this helps when you want to have the script validating many tables). The second line shows an important ERROR message. This message only appears when the number of rows does not match between each table, implying an important problem (for instance, some JOINs with problems). The third line summaries the number of differences. In our case, we see that we have less than 0.1% of the rows with some errors. Of course, the goal is to have 0 error. But still, the current number is far better than 75% of errors for instance. The last line only appears when there are some errors. It gives the vimdiff command that you must execute to have more insights on the errors.
Let's then execute this command:
vimdiff -o -c "windo set wrap foldcolumn=0" /data/sourygna/validation/tovalidate_sot/input_table1/{orig,result}/sorted.gz
We would get a similar output:
The screen is separated in 2 parts: above you can see the data of the "orig" source-of-truth table, and below the data of the Hive generated table. In each part, each line represents a row of the table. For instance, in the source-of-truth table we can clearly see the first 11 rows. The rest of the rows in the table (7857 rows) have been folded by vimdiff because there are exactly identical to the rows in the table to validate.
The columns are separated by "|" (if such character is used in your data, then you will have to modify that character in the script).
Vimdiff helps us to see the differences using the blue,red and pink colours:
What motivates me to use vimdiff, is that is shows clearly the differences. And not only the differences but the whole row where the difference appeared and also some "good" rows. With such context, it is much easier for me to understand where is the error and then correct my Hive query.
Let's suppose that we now have 3 tables to validate: table1, table2 and table3.
Let's also assume that the table2 has 2 special columns:
So obviously, it makes no sense to try to validate those 2 columns. Should we do this, then the script would show 100% of rows in error for table2. What we have to do, is to exclude those 2 columns from our validation process. To exclude this, on the command line after the name of the table, we will put the list of those columns prefixed by the ":" character. Each column in this list must be separated by a "," (and if you have the bad luck to have some columns whose names have some spaces, substitute the spaces by ".").
Since we want to validate 3 tables, we will place 3 arguments on the command line:
./validationHiveTablesLinuxMode.sh table1 table2:bar1,bar2 table3
The script will first validate table1, then table2 (without trying to validate the 2 excluded columns) and finally table3.
Here are the requisites I consider important for a validation tool:
We can see that the script I have developed covered most of those aspects:
Apart from the variables origDatabase and resultDatabase, there are other parameters that you might want to change:
If some columns use numerical values that are not integer (or bigInteger), then you might have a problem because the representation of the float "0.2" might not be the same between Hadoop cluster and the relational database.
So for those columns you might see a lot of "small rounding differences".
The script does not provide any solution to cope that problem. What I sometime do, is first to check with vimdiff if all the differences are small (that is for instance: 0.20001 and 0.2). If such, I then execute another time the validation script on the same table, excluding the columns with those numerical values I have manually checked.
The script keeps track of all your validation activities in the baseDir directory.
In that directory, you will find a directory for each couple (source of truth database, database to validate).
In each of those subdirectories, you will find:
Created on 11-23-2017 05:16 AM
Recently I have developed a new program that allows to do the same (Hive comparisons) in a much more efficient way (totally scalable, better visualization of the differences, skew consideration etc). Please consider this new approach instead of the old one I published in this article 2 years ago:
https://github.com/bolcom/hive_compared_bq
Code is again OpenSource so I welcome feedbacks, suggestions of new features and also contributions 🙂
Created on 07-31-2018 11:08 PM
That's a very neat script. It comes with a cost of copying the data to local first & then doing the comparison. Imagine the data in TBs, even few hundred GBs can make the whole script go nuts.
Created on 06-28-2019 05:53 AM
Hi,
Thanks for the Script, it solves most of my automation problems where i need to compare hive tables.
few things i am trying to modify and its not working for me.
we have cluster with hive installed on multi node (load balancer is enabled for HS2) , and we are using beeline (instead of hive cli) to execute queries to get data locally. as cluster is enabled for load balancer, it is executing two queries in two different nodes and local data is now in two nodes and script not able to get the actual data and failing.
not sure how to make it work only on single node using beeline.
Cluster is kerbrose, sentry and hs2 enabled for load balancer