The purpose of the script is to sqoop in parallel many tables and to store them into Hive.
In a previous project, I needed to download around 400 tables (out of 500) of a SQLserver database.
Most of the tables were quite small (less than a few MB), which means that the overhead of fetching the metadata in Sqoop (establishing the connexions, getting the DDL of the table...) is very important compared to the time to do the real job (download the data). Getting the 400 tables could take around 6 hours.
To speed up the downloading process, Sqoop usually "splits" a table in 4 parts, to parallelise the downloading process in 4 streams. In my project, such approach was not working:
Many tables had not clear primary keys or the distribution of those keys was not uniform. Trying to find better split columns for 400 tables was a waste of time...
Trying to split a table in several parts for downloading is sometimes not optimal for the database (how are those rows stored on disk?)
Most of the tables have few data so trying to speed the download of data won't make lot of improvement. We need to speedup the fetching of the metadata.
So the idea of the script is, instead of parallelising the downloading of the data for 1 single table into several stream, to download several tables at the same time. And each table will have only 1 single stream of download.
With that approach, I was able to download the 400 tables in less than 1 hour.
Examples of execution
The easiest way to execute the script is (take care to first configure the SQL driver, user and password. See next chapter):
With fileWithListOfTable being a file that lists all the tables we want to Sqoop.
For instance, if we want to Sqoop the 6 tables table1, table2, table3, table4, table5, table6, then the file must contain only 6 lines, 1 for each table:
The script will launch first 4 Sqoop processes, to download the first 4 tables. When one of those first processes finishes, the script launch another Sqoop process for "table5". So that we will always have 4 active Sqoop processes till there is no more table to sqoop.
You can also use some options to tune the script behaviour. For instance:
In this case, we change the name of the relational and Hive databases. We also change the parallelism to have 6 Sqoop processes working at the same time. And we choose the "etl" Yarn queue instead of the default one.
The default configuration can be encountered at the beginning of the script. You will have to change the default values or override them on the command line.
Here are the variables that can be modified:
origServer=myRelationalDatabase.example.com # The FQDN of the relational database you want to fetch (option: -o)
origDatabase=myDatabase # The names of the database that contains the tables to fetch (option: -d)
hiveDatabase=myHiveDatabase # The name of the Hive database that will get the tables fetched (option: -H)
parallelism=4 # The number of tables (sqoop processes) you want to download at the same time (option: -p)
queue=default # The queue used in Yarn (option: -q)
baseDir=/tmp/sqoopTables # Base directory where will be stored the log files (option: -b)
dirJavaGeneratedCode=/tmp/doSqoopTable-`id -u`-tmp # Directory for the java code generated by Sqoop (option: -c)
targetTmpHdfsDir=/tmp/doSqoopTable-`id -u`-tmp/$$ # Temporary directory in HDFS to store downloaded data before moving it to Hive
This script is focused to SQLserver. Search the "sqoop import" line (in the middle of the script) and change the header of the URL appropriately. Take care also to change in this line the user and password needed to connect to the relation database.
Some few more notes
The script shows on the standard output the name of each table it has started to download, so that you can easily know how much part of the work defined in the "listOfTables" table has been accomplished.
It also stores more information in the logging directory (by default: /tmp/sqoopTables). For each parallelisation stream (4 streams by default), you will have 2 kind of files available:
process_N-summary.log: an overview log. After executing the script, you should always have a look at those files. You may see the Hive stats (for instance: "[numFiles=1, totalSize=1263]"). But you must also get sure that there is no java error, due to an error when trying to Sqoop some tables (the standard output won't show up errors, this is why you must have a look at those files).
process_N-raw.log: this is the whole standard output of each Sqoop execution for all the tables downloaded by this stream. After having finishing downloading a table, the script will "tail" the last 6 lines of this file and write them in the process_N-summary.log file. That is why the "summary log" is a quick way to detect errors. The "raw log" enables you to get enough details to debug any issue that might happen.
By default, the script uses 4 streams, meaning that 4 tables will be sqooped at the same time (thus, 4 connections to the relational database will be established). This number was chosen because Sqoop uses 4 as a default.
However, this number is quite conservative and you might easily put a higher degree of parallelism (even more if your tables are quite small).
In my SQLserver project for instance, I have set that number to 12. For another Teradata project, I used 54 (more due to a limitation of containers on the Hadoop side than a limitation on the Teradata).
To avoid wasting containers, the script makes use of ubertask.
Mapping of the name of the tables
The names of the tables in Hive might be a bit different from the names in the relational database.
In the middle of the doSqoop() function, there is an example (commented) showing how to establish some mappings.
For instance, you might want to change all the table names that start with "raw_" by "ro_".