- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Sqoop import : composite primary key and textual primary key
- Labels:
-
Apache Hive
Created 05-13-2016 09:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source DB schema is on sql server and it contains several tables which either have primary key as :
- A varchar
- Composite - two varchar columns or one varchar + one int column or two int columns. There is a large table with ? rows which has three columns in the PK one int + two varchar columns
As per the Sqoop documentation :
Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
The first question is : What is expected by 'manually choose a splitting column' - how can I sacrifice the pk and just use one column or am I missing some concept ?
I proceeded with the import, the source table has 7909097 records :
sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose
The worrisome warnings and the incorrect mapper inputs and records :
16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job. 16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column. 16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1168400 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1128 HDFS: Number of bytes written=209961941 HDFS: Number of read operations=32 HDFS: Number of large read operations=0 HDFS: Number of write operations=16 Job Counters Launched map tasks=8 Other local map tasks=8 Total time spent by all maps in occupied slots (ms)=62785 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=62785 Total vcore-seconds taken by all map tasks=62785 Total megabyte-seconds taken by all map tasks=128583680 Map-Reduce Framework Map input records=15818167 Map output records=15818167 Input split bytes=1128 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=780 CPU time spent (ms)=45280 Physical memory (bytes) snapshot=2219433984 Virtual memory (bytes) snapshot=20014182400 Total committed heap usage (bytes)=9394716672 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=209961941 16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec) 16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.
Created table :
CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8), `ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';
Awful result(without errors) --PROBLEM : 15818167 vs 7909097(sql server) records:
> select count(1) from ChassiECU; Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1446726117927_0059) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 14 14 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s -------------------------------------------------------------------------------- OK _c0 15818167
Surprisingly, I got either accurate or a mismatch of less than 10 records if the composite key consisted of an int(which was used for splitting) but I am still apprehensive about those as well !
How shall I proceed ?
Created 05-13-2016 10:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised.
When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).
Created 05-13-2016 10:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised.
When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).
Created 05-13-2016 10:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some confusion, can you help ?
- Does the split-by column wrong choice cause the mismatch in no. of rows in source schema and hdfs
- Does the --num-mappers wrong choice cause the mismatch in no. of rows in source schema and hdfs
- Can a combination of above two cause the mismatch in no. of rows in source schema and hdfs
Created 05-13-2016 11:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Checking at the mismatch, you have:
- 7909097 rows in your source table
- 15818167 rows in your Hive table. This number is nearly the double of the one in your source table, which kind of confirm the warning Sqoop made you "If your database sorts in a case-insensitive order,this may result in a partialimportor duplicate records". And as you said, if you have an Int column, you don't have that kind of duplication. (by the way, you don't need to do a "select count()" in Hive to know the numbers of rows. You can just check the counters "Map input records=15818167" and "Map output records=15818167" in your Sqoop job; that will give you as much or more information and that will help you to debug it).
As a test, you could try to use "--num-mappers 1", that should remove the duplication. You'll be sacrificing speed for that, but that might not be an issue if you Sqoop a lot of tables in parallel as I mentioned in my other post.
You could also do another test choosing a split-column that is a numeric value, so that you won't suffer a duplication due to lower/upper case.
Created 05-13-2016 10:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have many tables to Sqoop, choosing some specific columns for each one of them can be cumbersome.
In such case, you might consider having just 1 mapper for each table to Sqoop, and launch many Sqoop processes in parallel. You can refer to this article for that:
https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html
