Created on 03-18-2016 02:40 PM
This wiki page describes the script sqoopTables.sh (GitHub: https://github.com/sourygnahtw/hadoopUtils/blob/master/scripts/sqoop/sqoopTables.sh).
The purpose of the script is to sqoop in parallel many tables and to store them into Hive.
In a previous project, I needed to download around 400 tables (out of 500) of a SQLserver database.
Most of the tables were quite small (less than a few MB), which means that the overhead of fetching the metadata in Sqoop (establishing the connexions, getting the DDL of the table...) is very important compared to the time to do the real job (download the data). Getting the 400 tables could take around 6 hours.
To speed up the downloading process, Sqoop usually "splits" a table in 4 parts, to parallelise the downloading process in 4 streams. In my project, such approach was not working:
So the idea of the script is, instead of parallelising the downloading of the data for 1 single table into several stream, to download several tables at the same time. And each table will have only 1 single stream of download.
With that approach, I was able to download the 400 tables in less than 1 hour.
The easiest way to execute the script is (take care to first configure the SQL driver, user and password. See next chapter):
With fileWithListOfTable being a file that lists all the tables we want to Sqoop.
For instance, if we want to Sqoop the 6 tables table1, table2, table3, table4, table5, table6, then the file must contain only 6 lines, 1 for each table:
table1 table2 table3 table4 table5 table6
The script will launch first 4 Sqoop processes, to download the first 4 tables. When one of those first processes finishes, the script launch another Sqoop process for "table5". So that we will always have 4 active Sqoop processes till there is no more table to sqoop.
You can also use some options to tune the script behaviour. For instance:
./sqoopTables.sh -d myDatabase2 -H myHiveDatabase3 -p 6 -q etl listOfTables
In this case, we change the name of the relational and Hive databases. We also change the parallelism to have 6 Sqoop processes working at the same time. And we choose the "etl" Yarn queue instead of the default one.
The default configuration can be encountered at the beginning of the script. You will have to change the default values or override them on the command line.
Here are the variables that can be modified:
origServer=myRelationalDatabase.example.com # The FQDN of the relational database you want to fetch (option: -o) origDatabase=myDatabase # The names of the database that contains the tables to fetch (option: -d) hiveDatabase=myHiveDatabase # The name of the Hive database that will get the tables fetched (option: -H) parallelism=4 # The number of tables (sqoop processes) you want to download at the same time (option: -p) queue=default # The queue used in Yarn (option: -q) baseDir=/tmp/sqoopTables # Base directory where will be stored the log files (option: -b) dirJavaGeneratedCode=/tmp/doSqoopTable-`id -u`-tmp # Directory for the java code generated by Sqoop (option: -c) targetTmpHdfsDir=/tmp/doSqoopTable-`id -u`-tmp/$$ # Temporary directory in HDFS to store downloaded data before moving it to Hive
This script is focused to SQLserver. Search the "sqoop import" line (in the middle of the script) and change the header of the URL appropriately. Take care also to change in this line the user and password needed to connect to the relation database.
The script shows on the standard output the name of each table it has started to download, so that you can easily know how much part of the work defined in the "listOfTables" table has been accomplished.
It also stores more information in the logging directory (by default: /tmp/sqoopTables). For each parallelisation stream (4 streams by default), you will have 2 kind of files available:
By default, the script uses 4 streams, meaning that 4 tables will be sqooped at the same time (thus, 4 connections to the relational database will be established). This number was chosen because Sqoop uses 4 as a default.
However, this number is quite conservative and you might easily put a higher degree of parallelism (even more if your tables are quite small). In my SQLserver project for instance, I have set that number to 12. For another Teradata project, I used 54 (more due to a limitation of containers on the Hadoop side than a limitation on the Teradata).
To avoid wasting containers, the script makes use of ubertask.
The names of the tables in Hive might be a bit different from the names in the relational database.
In the middle of the doSqoop() function, there is an example (commented) showing how to establish some mappings.
For instance, you might want to change all the table names that start with "raw_" by "ro_".