Created 05-04-2016 04:29 PM
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The steps that I am following :
I didn't use anything pertaining to charset/unicode/utf-8 while executing the sqoop import commands.
While creating the Hive external table, I was wondering what data type shall I select for the nvarchar columns in the original sql server table, now I am worried that even in Sqoop while importing that needs to be addressed.
****************************edit : Added further code
I now have a problem, Sqoop is auto. converting nvarchar to String
[sqoop@l1038lab root]$ hadoop fs -cat /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc { "type" : "record", "name" : "DimSampleDesc", "doc" : "Sqoop import of DimSampleDesc", "fields" : [ { "name" : "SmapiName_ver", "type" : [ "null", "string" ], "default" : null, "columnName" : "SmapiName_ver", "sqlType" : "12" }, { "name" : "SmapiColName", "type" : [ "null", "string" ], "default" : null, "columnName" : "SmapiColName", "sqlType" : "12" }, { "name" : "ChartType", "type" : [ "null", "int" ], "default" : null, "columnName" : "ChartType", "sqlType" : "4" }, { "name" : "X_Indexet", "type" : [ "null", "int" ], "default" : null, "columnName" : "X_Indexet", "sqlType" : "4" }, { "name" : "Y_Indexet", "type" : [ "null", "int" ], "default" : null, "columnName" : "Y_Indexet", "sqlType" : "4" }, { "name" : "X_Tick", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_Tick", "sqlType" : "-9" }, { "name" : "Y_Tick", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_Tick", "sqlType" : "-9" }, { "name" : "X_TickRange", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_TickRange", "sqlType" : "-9" }, { "name" : "X_TickRangeFrom", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_TickRangeFrom", "sqlType" : "-9" }, { "name" : "X_TickRangeTom", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_TickRangeTom", "sqlType" : "-9" }, { "name" : "Y_TickRange", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_TickRange", "sqlType" : "-9" }, { "name" : "Y_TickRangeFrom", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_TickRangeFrom", "sqlType" : "-9" }, { "name" : "Y_TickRangeTom", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_TickRangeTom", "sqlType" : "-9" }, { "name" : "IndexCount", "type" : [ "null", "int" ], "default" : null, "columnName" : "IndexCount", "sqlType" : "4" }, { "name" : "X_IndexCount", "type" : [ "null", "int" ], "default" : null, "columnName" : "X_IndexCount", "sqlType" : "4" }, { "name" : "Y_IndexCount", "type" : [ "null", "int" ], "default" : null, "columnName" : "Y_IndexCount", "sqlType" : "4" }, { "name" : "X_Symbol", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_Symbol", "sqlType" : "-9" }, { "name" : "X_SymbolName", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_SymbolName", "sqlType" : "-9" }, { "name" : "X_SymbolDescr", "type" : [ "null", "string" ], "default" : null, "columnName" : "X_SymbolDescr", "sqlType" : "-9" }, { "name" : "Y_Symbol", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_Symbol", "sqlType" : "-9" }, { "name" : "Y_SymbolName", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_SymbolName", "sqlType" : "-9" }, { "name" : "Y_SymbolDescr", "type" : [ "null", "string" ], "default" : null, "columnName" : "Y_SymbolDescr", "sqlType" : "-9" }, { "name" : "SmapiName", "type" : [ "null", "string" ], "default" : null, "columnName" : "SmapiName", "sqlType" : "12" }, { "name" : "Incorrect_Ver_FL", "type" : [ "null", "boolean" ], "default" : null, "columnName" : "Incorrect_Ver_FL", "sqlType" : "-7" } ], "tableName" : "DimSampleDesc" }[sqoop@l1038lab root]$
hive> > CREATE EXTERNAL TABLE DimSampleDesc ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc'); OK Time taken: 0.37 seconds hive>
hive (odp_dw_may2016)> > > describe formatted dimsampledesc; OK col_name data_type comment # col_name data_type comment smapiname_ver string smapicolname string charttype int x_indexet int y_indexet int x_tick string y_tick string x_tickrange string x_tickrangefrom string x_tickrangetom string y_tickrange string y_tickrangefrom string y_tickrangetom string indexcount int x_indexcount int y_indexcount int x_symbol string x_symbolname string x_symboldescr string y_symbol string y_symbolname string y_symboldescr string smapiname string incorrect_ver_fl boolean # Detailed Table Information Database: odp_dw_may2016 Owner: hive CreateTime: Mon May 09 16:37:08 CEST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://l1031lab.sss.se.com:8020/apps/hive/warehouse/odp_dw_may2016.db/dimsampledesc Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE false EXTERNAL TRUE avro.schema.url hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc numFiles 6 numRows -1 rawDataSize -1 totalSize 8566342 transient_lastDdlTime 1462804628 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.avro.AvroSerDe InputFormat: org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.441 seconds, Fetched: 56 row(s) hive (odp_dw_may2016)>
What's the risk here and how do I proceed ?
Created 05-06-2016 04:34 PM
Unicode file:
[root@test test]# pwd
/root/test
[root@test test]# cat xyz
Les caractères accentués (Français)
En données nous avons confiance
Données, données, partout
et tous les noeuds étaient déconnecté
Données, données, partout
[root@test test]#
External table DDL:
create external table demand_db.unicode
(data string)
COMMENT 'External table for data cleansing'
LOCATION '/tmp/test/';
External table location:
[root@test ~]# hdfs dfs -mkdir -p /tmp/test
[root@test ~]# hdfs dfs -chmod -R 777 /tmp/test
[root@test ~]# hdfs dfs -ls /tmp
Output:
hive> create external table unicode
> (data string)
> COMMENT 'External table for data cleansing'
> LOCATION '/tmp/test/';
OK
Time taken: 0.502 seconds
hive> select * from unicode;
OK
Les caractères accentués (Français)
En données nous avons confiance
Données, données, partout
et tous les noeuds étaient déconnecté
Données, données, partout
Time taken: 0.897 seconds, Fetched: 8 row(s)
hive>
Conclusion: You do not need to covert unicode character set. Also String works perfectly in this case.
Thanks
Created 05-04-2016 10:16 PM
By default Hive is UTF-8, I would recommend doing a sqoop import to Hive table directly. Define a Table DDL and Import into table. Yes you can use string for nvarchar.
Created 05-06-2016 06:36 AM
Hive has both varchar and string data types, which should be used in case of nvarchar.
Even I am in a dilemma about whether to import directly into Hive or first-HDFS-then-Hive, maybe I will try both.
Created 05-06-2016 06:42 AM
By default, sqoop will directly import nvarchar to string (if you let sqoop create hive tables).
If you create a HCatalog table stored as ORC in advance, you can directly right to Hive as ORC and gain performance and compression. So, I suggest you try put directly into Hive.
Created 05-06-2016 04:34 PM
Unicode file:
[root@test test]# pwd
/root/test
[root@test test]# cat xyz
Les caractères accentués (Français)
En données nous avons confiance
Données, données, partout
et tous les noeuds étaient déconnecté
Données, données, partout
[root@test test]#
External table DDL:
create external table demand_db.unicode
(data string)
COMMENT 'External table for data cleansing'
LOCATION '/tmp/test/';
External table location:
[root@test ~]# hdfs dfs -mkdir -p /tmp/test
[root@test ~]# hdfs dfs -chmod -R 777 /tmp/test
[root@test ~]# hdfs dfs -ls /tmp
Output:
hive> create external table unicode
> (data string)
> COMMENT 'External table for data cleansing'
> LOCATION '/tmp/test/';
OK
Time taken: 0.502 seconds
hive> select * from unicode;
OK
Les caractères accentués (Français)
En données nous avons confiance
Données, données, partout
et tous les noeuds étaient déconnecté
Données, données, partout
Time taken: 0.897 seconds, Fetched: 8 row(s)
hive>
Conclusion: You do not need to covert unicode character set. Also String works perfectly in this case.
Thanks
Created 05-09-2016 11:39 AM
I proceeded with varchar, what is your opinion ? Shall I use String or VARCHAR ?
Created 05-10-2016 05:26 AM
@Kaliyug Antagonist
You should use varchar as much as possible.
Because
1. Silent truncate in string.
2. String by default maps to varchar (32762). For smaller columns this might lead to performance issue.
Thanks
Created 05-10-2016 08:38 AM
Edited my original post with Sqoop and Hive outputs - the nvarchar is auto. converted to string - can you help ?
Created 05-10-2016 11:20 PM
@Kaliyug Antagonist
Good question.
Try this is your sqoop argument:
--map-column-java id=NVARCHAR,value=VARCHAR
There is another option while creating hive tables from sqoop:
--map-column-hive id=NVARCHAR,value=VARCHAR(n)
**Choose your n.
Thanks