Support Questions

Find answers, ask questions, and share your expertise

Sqoop import : composite primary key and textual primary key

avatar
Super Collaborator

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

The source DB schema is on sql server and it contains several tables which either have primary key as :

  • A varchar
  • Composite - two varchar columns or one varchar + one int column or two int columns. There is a large table with ? rows which has three columns in the PK one int + two varchar columns

As per the Sqoop documentation :

Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

The first question is : What is expected by 'manually choose a splitting column' - how can I sacrifice the pk and just use one column or am I missing some concept ?

I proceeded with the import, the source table has 7909097 records :

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|'  --table ChassiECU --num-mappers 8  --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

The worrisome warnings and the incorrect mapper inputs and records :

16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=1168400
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1128
                HDFS: Number of bytes written=209961941
                HDFS: Number of read operations=32
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Job Counters
                Launched map tasks=8
                Other local map tasks=8
                Total time spent by all maps in occupied slots (ms)=62785
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=62785
                Total vcore-seconds taken by all map tasks=62785
                Total megabyte-seconds taken by all map tasks=128583680
        Map-Reduce Framework
                Map input records=15818167
                Map output records=15818167
                Input split bytes=1128
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=780
                CPU time spent (ms)=45280
                Physical memory (bytes) snapshot=2219433984
                Virtual memory (bytes) snapshot=20014182400
                Total committed heap usage (bytes)=9394716672
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.

Created table :

CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';

Awful result(without errors) --PROBLEM : 15818167 vs 7909097(sql server) records:

 > select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     14         14        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167

Surprisingly, I got either accurate or a mismatch of less than 10 records if the composite key consisted of an int(which was used for splitting) but I am still apprehensive about those as well !

How shall I proceed ?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised.

When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised.

When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).

avatar
Super Collaborator

Some confusion, can you help ?

  • Does the split-by column wrong choice cause the mismatch in no. of rows in source schema and hdfs
  • Does the --num-mappers wrong choice cause the mismatch in no. of rows in source schema and hdfs
  • Can a combination of above two cause the mismatch in no. of rows in source schema and hdfs

avatar
Super Collaborator

Checking at the mismatch, you have:

  • 7909097 rows in your source table
  • 15818167 rows in your Hive table. This number is nearly the double of the one in your source table, which kind of confirm the warning Sqoop made you "If your database sorts in a case-insensitive order,this may result in a partialimportor duplicate records". And as you said, if you have an Int column, you don't have that kind of duplication. (by the way, you don't need to do a "select count()" in Hive to know the numbers of rows. You can just check the counters "Map input records=15818167" and "Map output records=15818167" in your Sqoop job; that will give you as much or more information and that will help you to debug it).

As a test, you could try to use "--num-mappers 1", that should remove the duplication. You'll be sacrificing speed for that, but that might not be an issue if you Sqoop a lot of tables in parallel as I mentioned in my other post.

You could also do another test choosing a split-column that is a numeric value, so that you won't suffer a duplication due to lower/upper case.

avatar
Super Collaborator

If you have many tables to Sqoop, choosing some specific columns for each one of them can be cumbersome.

In such case, you might consider having just 1 mapper for each table to Sqoop, and launch many Sqoop processes in parallel. You can refer to this article for that:

https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html