Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to load and store nvarchar

avatar
Super Collaborator

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

The steps that I am following :

  1. Load SQL server tables onto HDFS using Sqoop
  2. Create EXTERNAL tables in Hive

I didn't use anything pertaining to charset/unicode/utf-8 while executing the sqoop import commands.

While creating the Hive external table, I was wondering what data type shall I select for the nvarchar columns in the original sql server table, now I am worried that even in Sqoop while importing that needs to be addressed.

  1. Couldn't find relevant charset/nvarchar etc. options in Sqoop import
  2. In Hive, can varchar/string blindly be used in place of nvarchar

****************************edit : Added further code

I now have a problem, Sqoop is auto. converting nvarchar to String

  • When I mention --as-avrodatafile, .avro files and a schema file is created which is as follows :
[sqoop@l1038lab root]$ hadoop fs -cat /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
{
  "type" : "record",
  "name" : "DimSampleDesc",
  "doc" : "Sqoop import of DimSampleDesc",
  "fields" : [ {
    "name" : "SmapiName_ver",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName_ver",
    "sqlType" : "12"
  }, {
    "name" : "SmapiColName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiColName",
    "sqlType" : "12"
  }, {
    "name" : "ChartType",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "ChartType",
    "sqlType" : "4"
  }, {
    "name" : "X_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "Y_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "X_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "Y_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "SmapiName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName",
    "sqlType" : "12"
  }, {
    "name" : "Incorrect_Ver_FL",
    "type" : [ "null", "boolean" ],
    "default" : null,
    "columnName" : "Incorrect_Ver_FL",
    "sqlType" : "-7"
  } ],
  "tableName" : "DimSampleDesc"
}[sqoop@l1038lab root]$
  • I then simply create a hive external table
hive>
    > CREATE EXTERNAL TABLE DimSampleDesc  ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  TBLPROPERTIES (    'avro.schema.url'='hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc');
OK
Time taken: 0.37 seconds
hive>
  • The table described as follows :
hive (odp_dw_may2016)>
                     >
                     > describe formatted dimsampledesc;
OK
col_name        data_type       comment
# col_name              data_type               comment
smapiname_ver           string
smapicolname            string
charttype               int
x_indexet               int
y_indexet               int
x_tick                  string
y_tick                  string
x_tickrange             string
x_tickrangefrom         string
x_tickrangetom          string
y_tickrange             string
y_tickrangefrom         string
y_tickrangetom          string
indexcount              int
x_indexcount            int
y_indexcount            int
x_symbol                string
x_symbolname            string
x_symboldescr           string
y_symbol                string
y_symbolname            string
y_symboldescr           string
smapiname               string
incorrect_ver_fl        boolean
# Detailed Table Information
Database:               odp_dw_may2016
Owner:                  hive
CreateTime:             Mon May 09 16:37:08 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l1031lab.sss.se.com:8020/apps/hive/warehouse/odp_dw_may2016.db/dimsampledesc
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        avro.schema.url         hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
        numFiles                6
        numRows                 -1
        rawDataSize             -1
        totalSize               8566342
        transient_lastDdlTime   1462804628
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.avro.AvroSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.441 seconds, Fetched: 56 row(s)
hive (odp_dw_may2016)>

What's the risk here and how do I proceed ?

1 ACCEPTED SOLUTION

avatar

@Kaliyug Antagonist

Unicode file:

[root@test test]# pwd

/root/test

[root@test test]# cat xyz

Les caractères accentués (Français)

En données nous avons confiance

Données, données, partout

et tous les noeuds étaient déconnecté

Données, données, partout

[root@test test]#

External table DDL:

create external table demand_db.unicode

(data string)

COMMENT 'External table for data cleansing'

LOCATION '/tmp/test/';

External table location:

[root@test ~]# hdfs dfs -mkdir -p /tmp/test

[root@test ~]# hdfs dfs -chmod -R 777 /tmp/test

[root@test ~]# hdfs dfs -ls /tmp

Output:

hive> create external table unicode

> (data string)

> COMMENT 'External table for data cleansing'

> LOCATION '/tmp/test/';

OK

Time taken: 0.502 seconds

hive> select * from unicode;

OK

Les caractères accentués (Français)

En données nous avons confiance

Données, données, partout

et tous les noeuds étaient déconnecté

Données, données, partout

Time taken: 0.897 seconds, Fetched: 8 row(s)

hive>

Conclusion: You do not need to covert unicode character set. Also String works perfectly in this case.

Thanks

View solution in original post

8 REPLIES 8

avatar
Explorer

By default Hive is UTF-8, I would recommend doing a sqoop import to Hive table directly. Define a Table DDL and Import into table. Yes you can use string for nvarchar.

avatar
Super Collaborator

Hive has both varchar and string data types, which should be used in case of nvarchar.

Even I am in a dilemma about whether to import directly into Hive or first-HDFS-then-Hive, maybe I will try both.

avatar
Guru

By default, sqoop will directly import nvarchar to string (if you let sqoop create hive tables).

If you create a HCatalog table stored as ORC in advance, you can directly right to Hive as ORC and gain performance and compression. So, I suggest you try put directly into Hive.

avatar

@Kaliyug Antagonist

Unicode file:

[root@test test]# pwd

/root/test

[root@test test]# cat xyz

Les caractères accentués (Français)

En données nous avons confiance

Données, données, partout

et tous les noeuds étaient déconnecté

Données, données, partout

[root@test test]#

External table DDL:

create external table demand_db.unicode

(data string)

COMMENT 'External table for data cleansing'

LOCATION '/tmp/test/';

External table location:

[root@test ~]# hdfs dfs -mkdir -p /tmp/test

[root@test ~]# hdfs dfs -chmod -R 777 /tmp/test

[root@test ~]# hdfs dfs -ls /tmp

Output:

hive> create external table unicode

> (data string)

> COMMENT 'External table for data cleansing'

> LOCATION '/tmp/test/';

OK

Time taken: 0.502 seconds

hive> select * from unicode;

OK

Les caractères accentués (Français)

En données nous avons confiance

Données, données, partout

et tous les noeuds étaient déconnecté

Données, données, partout

Time taken: 0.897 seconds, Fetched: 8 row(s)

hive>

Conclusion: You do not need to covert unicode character set. Also String works perfectly in this case.

Thanks

avatar
Super Collaborator

I proceeded with varchar, what is your opinion ? Shall I use String or VARCHAR ?

avatar

@Kaliyug Antagonist

You should use varchar as much as possible.

Because

1. Silent truncate in string.

2. String by default maps to varchar (32762). For smaller columns this might lead to performance issue.

Thanks

avatar
Super Collaborator

Edited my original post with Sqoop and Hive outputs - the nvarchar is auto. converted to string - can you help ?

avatar

@Kaliyug Antagonist

Good question.

Try this is your sqoop argument:

--map-column-java id=NVARCHAR,value=VARCHAR

There is another option while creating hive tables from sqoop:

--map-column-hive id=NVARCHAR,value=VARCHAR(n)

**Choose your n.

Thanks