Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Import from SQL Server, data types not converted properly

Import from SQL Server, data types not converted properly

Expert Contributor

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

Objective :

  1. Import tables from SQL Server onto HDFS in Avro format
  2. Create EXTERNAL Hive Avro(SerDe) tables which have all the data
  3. Create EXTERNAL Hive ORC tables and insert into ORC select * from Avro tables
  4. Drop the Avro tables and perform testing on the ORC tables

One of the tables :

ECU_DTC_ID			int
DTC_CDE				nchar(20)
ECU_NAME			nvarchar(15)
ECU_FAMILY_NAME		nvarchar(15)
DTC_DESC			nvarchar(MAX)
INSERTED_BY			nvarchar(64)
INSERTION_DATE		datetime
DTC_CDE_DECIMAL		int

When I execute the normal sqoop import, the datetime is converted to long, nchar and nvarchar to String. The resultant avsc file is as shown, when I create a Hive Avro table, it doesn't include the generated avro files, thus leaving behind an empty table :

{
  "type" : "record",
  "name" : "DimECUDTCCode",
  "doc" : "Sqoop import of DimECUDTCCode",
  "fields" : [ {
    "name" : "ECU_DTC_ID",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "ECU_DTC_ID",
    "sqlType" : "4"
  }, {
    "name" : "DTC_CDE",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "DTC_CDE",
    "sqlType" : "-15"
  }, {
    "name" : "ECU_NAME",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "ECU_NAME",
    "sqlType" : "-9"
  }, {
    "name" : "ECU_FAMILY_NAME",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "ECU_FAMILY_NAME",
    "sqlType" : "-9"
  }, {
    "name" : "DTC_DESC",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "DTC_DESC",
    "sqlType" : "-9"
  }, {
    "name" : "INSERTED_BY",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "INSERTED_BY",
    "sqlType" : "-9"
  }, {
    "name" : "INSERTION_DATE",
    "type" : [ "null", "long" ],
    "default" : null,
    "columnName" : "INSERTION_DATE",
    "sqlType" : "93"
  }, {
    "name" : "DTC_CDE_DECIMAL",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "DTC_CDE_DECIMAL",
    "sqlType" : "4"
  } ],
  "tableName" : "DimECUDTCCode"

I decided to include --map-column-java :

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusername--password somepassword --as-avrodatafile --num-mappers 8 --table DimECUDTCCode --map-column-java DTC_CDE=string,ECU_NAME=string,ECU_FAMILY_NAME=string,DTC_DESC=string,INSERTED_BY=string,INSERTION_DATE=timestamp --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

but I get the following error :

16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_NAME to string
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column ECU_FAMILY_NAME to string
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_DESC to string
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTED_BY to string
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type string
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column INSERTION_DATE to timestamp
16/05/12 09:43:12 ERROR orm.ClassWriter: No ResultSet method for Java type timestamp
16/05/12 09:43:12 INFO orm.ClassWriter: Overriding type of column DTC_CDE to string
16/05/12 09:43:12 ERROR tool.ImportTool: Imported Failed: No ResultSet method for Java type string
[sqoop@l1038lab root]$

What am I missing ?

EDIT-1:

Changed the sqoop command as suggested by the @Donald Campbell :

sqoop import --connect 'jdbc:sqlserver://SESOCO1898;database=ODP_DW' --username hadoop --password hadoop --as-avrodatafile --num-mappers 8  --query 'SELECT CAST(ECU_DTC_ID AS INTEGER) AS ECU_DTC_ID,CAST(DTC_CDE AS VARCHAR) AS DTC_CDE,CAST(ECU_NAME AS VARCHAR) AS ECU_NAME,CAST(ECU_FAMILY_NAME AS VARCHAR) AS ECU_FAMILY_NAME,CAST(DTC_DESC AS VARCHAR) AS DTC_DESC,CAST(INSERTED_BY AS VARCHAR) AS INSERTED_BY,CAST(INSERTION_DATE AS TIMESTAMP) AS INSERTION_DATE,CAST(DTC_CDE_DECIMAL AS INTEGER) AS DTC_CDE_DECIMAL FROM DimECUDTCCode WHERE $CONDITIONS' --split-by ECU_DTC_ID --target-dir /dumphere/sqoop/odp_dw/DimECUDTCCode  --verbose

The hdfs files :

[sqoop@l1038lab root]$ hadoop fs -ls /dumphere/sqoop/odp_dw/DimECUDTCCode
Found 13 items
-rw-rw-rw-   3 sqoop hdfs       1305 2016-05-20 13:43 /dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.avsc
-rw-rw-rw-   3 sqoop hdfs      15163 2016-05-20 13:43 /dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.class
-rw-rw-rw-   3 sqoop hdfs       5765 2016-05-20 13:43 /dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.jar
-rw-rw-rw-   3 sqoop hdfs      25686 2016-05-20 13:43 /dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.java
-rw-rw-rw-   3 sqoop hdfs          0 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/_SUCCESS
-rw-rw-rw-   3 sqoop hdfs     282863 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00000.avro
-rw-rw-rw-   3 sqoop hdfs     296604 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00001.avro
-rw-rw-rw-   3 sqoop hdfs     297730 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00002.avro
-rw-rw-rw-   3 sqoop hdfs     307128 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00003.avro
-rw-rw-rw-   3 sqoop hdfs      51645 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00004.avro
-rw-rw-rw-   3 sqoop hdfs        978 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00005.avro
-rw-rw-rw-   3 sqoop hdfs     167303 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00006.avro
-rw-rw-rw-   3 sqoop hdfs     310558 2016-05-20 13:42 /dumphere/sqoop/odp_dw/DimECUDTCCode/part-m-00007.avro

Created an external table :

CREATE EXTERNAL TABLE DimECUDTCCode  ROW FORMAT SERDE  'org.apache.hadoop.hiverde2.avro.AvroSerDe'  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  TBLPROPERTIES (    'avro.schema.url'='hdfs://l1031lab.sss.com:8020/dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.avsc');

The description of the table seems incorrect as :

  1. Instead of Timestamp, the column insertion_date is binary
  2. Normally, the table shows the no. of associated files which should be 7 (avro)files I suppose
hive (scratchpad)>
                 >
                 > describe formatted dimecudtccode;
OK
# col_name              data_type               comment
ecu_dtc_id              int
dtc_cde                 string
ecu_name                string
ecu_family_name         string
dtc_desc                string
inserted_by             string
insertion_date          binary
dtc_cde_decimal         int
# Detailed Table Information
Database:               scratchpad
Owner:                  hive
CreateTime:             Fri May 20 13:47:28 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l1031lab.sss.com:8020/apps/hive/warehouse/scratchpad.db/dimecudtccode
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        avro.schema.url         hdfs://l1031lab.sss.com:8020/dumphere/sqoop/odp_dw/DimECUDTCCode/QueryResult.avsc
        transient_lastDdlTime   1463744848
# Storage Information
SerDe Library:          org.apache.hadoop.hiverde2.avro.AvroSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.408 seconds, Fetched: 35 row(s)

The select statement doesn't yield any rows !

hive (scratchpad)>
                 > select * from dimecudtccode;
OK
Time taken: 0.101 seconds

Anything that I have missed ?

4 REPLIES 4

Re: Import from SQL Server, data types not converted properly

Rising Star

You should use java types if you use --map-columns-java. i.e. you need to use String, Timestamp, ...

Re: Import from SQL Server, data types not converted properly

Expert Contributor

Do you mean simply the case is wrong (NOT 'string' but 'String') ? If yes, that doesn't matter, the error is same.

Re: Import from SQL Server, data types not converted properly

Rising Star

Have you also tried with the full class name, i.e. java.lang.String?

Re: Import from SQL Server, data types not converted properly

New Contributor

When using SQL Server, I have previously gotten around data type conversions by using the --query switch and using explicit casts. Because this approach will put the conversions on SQL Server, you can confirm that the data is reaching your cluster in the correct format by running your SQL server through a SQL Server client.

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusername--password somepassword --as-avrodatafile --num-mappers 8 --query 'SELECT CAST (DTC_CDE AS STRING), CAST (INSERTION_DATE AS TIMESTAMP), ... FROM DimECUDTCCode WHERE $CONDITIONS' --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose --split-by [primary key]

If your SELECT query gets a bit long, you can use configuration files to shorten the length of the command line call.

https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_using_options_files_to_pass_arguments

Don't have an account?
Coming from Hortonworks? Activate your account here