Support Questions
Find answers, ask questions, and share your expertise

SQOOP importing a 2-million-record table from SQL Server

Expert Contributor

The objective is to import a table of 2 million records. The mapreduce job seems to completed OK. There is no error message. But the table did not appear in the target hive db.

Before this import, I was able to import another table with just 7 records successfully. The imported table appeared in hive db.

I compared the outputs of two imports and found out that, the latter one stopped at this output.

19/05/20 13:10:28 INFO mapreduce.ImportJobBase: Transferred 2.3685 GB in 526.6253 seconds (4.6054 MB/sec)
19/05/20 13:10:28 INFO mapreduce.ImportJobBase: Retrieved 2925046 records.

But the successfully completed import had a lot outputs after reporting “retrieved # records”

19/05/20 11:48:18 INFO mapreduce.ImportJobBase: Transferred 176 bytes in 67.7923 seconds (2.5962 bytes/sec)
19/05/20 11:48:18 INFO mapreduce.ImportJobBase: Retrieved 7 records.
19/05/20 11:48:18 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [DimDepartmentGroup] AS t WHERE 1=0
19/05/20 11:48:18 INFO hive.HiveImport: Loading uploaded data into Hive
19/05/20 11:48:20 INFO hive.HiveImport: SLF4J: Class path contains multiple SLF4J bindings.

Another difference between the two imports is that the successful one did the cleanup while the unsuccessful one did not. I can re-run the successful import without error. When I re-run the unsuccessful import, I ran into error that a folder in hdfs://user/hive/ path already exists. I had to delete that folder so the mapreduce process could continue, although it did not create the table at the end.

10 REPLIES 10

Mentor

@Haijin Li

Can you share your sqoop import command? There should be an issue with the command.

You need to explicitly tell sqoop to create a hive table like below after the run you should have a table haijin_hive created

--hive-table haijin_hive \
--create-hive-table \
--target-dir /tmp/hive_temp

Add the above to your command and let me know


Expert Contributor

@Geoffrey Shelton Okot

Thanks for the reply.

I tried including “create-hive-table” parameter and still got the same result. I did not apply the “target-dir” parameter because I want Hive to control this.

This is the import command ran successfully. I am able to re-run this one several times successfully. Certainly, I dropped the TestHive.DimDepartmentGroup table before the rerun.

sqoop import --connect "jdbc:sqlserver://abc:1433;database=ABC" --username abc --password abc --table "DimDepartmentGroup" --hive-import --hive-table TestHive.DimDepartmentGroup

This is the import command that did not create the table.

sqoop import --connect 'jdbc:sqlserver://xyz\\sql1:57830;database=XYZ' --username xyz --password xyz  --table 'Dim_Member' --split-by Member_SKey -- --schema mydw --hive-import --hive-table Testdb.dim_member --create-hive-table

You can see there are three key differences between the two imports but I do not think any of them matters.

  • Source SQL Server instance – default vs named
  • Source SQL db schema name – default (dbo) vs user-defined (mydw)
  • Whether to use “split-by”

Expert Contributor

Additional test showed that, the source sql server database schema name is the root cause. After I re-create the Dim_Member (a sql view) in dbo schema, the import command below works.

sqoop import --connect 'jdbc:sqlserver://xyz\\sql1:57830;database=XYZ' --username xyz --password xyz  --table 'Dim_Member' --split-by Member_SKey --hive-import --hive-table Testdb.dim_member

I can not find any documentation regarding specifying database schema in Sqoop tool. My only reference is this link. https://community.hortonworks.com/questions/50557/sqoop-import-sql-server-non-default-schema.html

I think the syntax of specifying the database schema is correct because at least the mapreduce job was created successfully and completed halfway.

Mentor

@Haijin Li

Can you include the --target dir as I pointed out before

--hive-table haijin_hive \
--create-hive-table \
--target-dir /tmp/hive_temp

And let me know 🙂

Expert Contributor

@Geoffrey Shelton Okot

I tested again with "target-dir" specified and got the same result. There is no error but the processed stopped at the following output.

19/05/22 12:02:24 INFO mapreduce.ImportJobBase: Transferred # bytes in # seconds
19/05/22 12:02:24 INFO mapreduce.ImportJobBase: Retrieved # records.

I checked the path specified as target-dir and it is empty.

Mentor

@Haijin Li

The target-dir is a scratch directory it's supposed to load the data into hive tables if you used the --create-hive-table ???

Expert Contributor

@Geoffrey Shelton Okot

Target-dir is not a scratch directory.

https://stackoverflow.com/questions/37415989/difference-between-warehouse-dir-and-target-dir-command...

But anyway, I did see the scratch directory created under the hdfs://users/hive path every time and there were files inside. I expect the scratch directory to be deleted automatically if the import command runs through the end. Actually, I had to manually delete this scratch folder every time before I ran my test because my test import never got completed and always stopped half-way.

Thanks again for your help.

Mentor

@Haijin Li

Wrong word scratch dir !!
So sqoop job still doesn't complete and create the hive tables right?

Expert Contributor

My sqoop import job still doesn't complete and create the hive tables.

Currently, my workaround is to create a dbo view on top of my warehouse view/table and then run the import job.

Expert Contributor

@Geoffrey Shelton Okot

My sqoop import job still doesn't complete and create the hive tables.

Currently, my workaround is to create a dbo view on top of my warehouse view/table and then run the import job.

; ;