Created 06-19-2018 12:24 PM
Creating a orc table with varchar(11) column and insert a value with a special character:
create table special_char (varchar(11)) stored as orc; insert into special_char values('1ºTrimestre'); select * from special_char; +--------------------+--+ | special_char.col1 | +--------------------+--+ | 1ºTrimestr | +--------------------+--+
Creating a textfile:
create table special_char_text (varchar(11)); insert into special_char_text values('1ºTrimestre'); select * from special_char_text; +--------------------+--+ | special_char_text.col1 | +--------------------+--+ | 1ºTrimestre | +--------------------+--+
Can someone explain why the value inside the column of ORC table is truncated?
Created 06-19-2018 02:50 PM
Hi @Hugo Almeida!
Guess you're from Brazil right? As I can see from your insert value (in portuguese hehe).
Unfortunately I wasn't able to simulate this issue (follow my test below) on Hive.
0: jdbc:hive2://node3:10000/default> create table special_char (x varchar(11)) stored as orc; No rows affected (0.393 seconds) 0: jdbc:hive2://node3:10000/default> insert into special_char values('1ºTrimestre'); INFO : Session is already open INFO : Dag name: insert into special_c...alues('1ºTrimestre')(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1529333672124_0002) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 5.75 s -------------------------------------------------------------------------------- INFO : Loading data to table default.special_char from hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/.hive-staging_hive_2018-06-19_14-41-46_630_2155266942579102600-1/-ext-10000 INFO : Table default.special_char stats: [numFiles=1, numRows=1, totalSize=260, rawDataSize=96] No rows affected (6.689 seconds) 0: jdbc:hive2://node3:10000/default> select * from special_char; +-----------------+--+ | special_char.x | +-----------------+--+ | 1ºTrimestre | +-----------------+--+ 1 row selected (0.109 secondsBut AFAIK, internally the varchar will receive a string value, and if the string can't fit into the varchar pool, then it will truncate.
Ps: I'm not sure if this is your case, and if we are running the same version.
Hope this helps!
Created 06-20-2018 10:28 AM
Hi @Vinicius Higa Murakami, I'm from Portugal 😄 Thanks for trying to repro the issue.
There's something different, can you also check these:
hive --version Hive 1.2.1000.2.6.1.0-129
describe extended special_char; | Detailed Table Information | Table(tableName:special_char, dbName:ustemp, owner:hive, createTime:1529404515, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col1, type:varchar(11), comment:null)], location:adl://azuredatalakestoretest.azuredatalakestore.net/apps/hive/warehouse/ustemp.db/special_char, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=246, numRows=1, rawDataSize=95, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=1, transient_lastDdlTime=1529404574}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) | |
I have to say we're using Azure Data Lake Store as a data repository for Hadoop.
Created 06-21-2018 12:47 AM
Hi @Hugo Almeida!
Ahh glad to hear that! I'm from Brazil, but living in Chile hehe 🙂
So regarding your hive issue, we're running almost the same version of hive:
[root@node3 ~]# hive --version
Hive 1.2.1000.2.6.5.0-292
Could you recreate the orc table with a bigger varchar? Like 12?
My guess would be that somehow your special character is consuming more than it should for ORC files.
Anyway, I also made other tests against this special char:
hive> select md5('º') from special_char; OK 91f9bcb8e28b5cb34d63c456efe3a29c Time taken: 0.077 seconds, Fetched: 1 row(s) hive> select md5('º') from special_char_text;; OK 91f9bcb8e28b5cb34d63c456efe3a29c Time taken: 0.064 seconds, Fetched: 1 row(s) hive> select ascii('º') from special_char_text; OK -62 Time taken: 0.07 seconds, Fetched: 1 row(s) hive> select ascii('º') from special_char; OK -62 Time taken: 0.072 seconds, Fetched: 1 row(s)Hope this helps!
Created 06-21-2018 09:43 AM
Yes, you're right, my special character is consuming more than it should for ORC files. With varchar(12) I don't have any problem.
I also checked the ascii and md5 functions and they return the same values for both tables.
Did you find any differences with the describe extended command?
Thanks
Created 06-21-2018 06:15 PM
Ola @Hugo Almeida !
Hm, gotcha.
Yeah, our output from describe extended is barely the same.
Could you try to execute the following cmd, plz?
[hive@node3 ~]$ hive --orcfiledump hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 Processing data file hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 [length: 262] Structure for hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 File Version: 0.12 with HIVE_13083 18/06/21 18:10:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 with {include: null, offset: 0, length: 9223372036854775807} 18/06/21 18:10:50 INFO orc.RecordReaderImpl: Reader schema not provided -- using file schema struct<_col0:varchar(11)> Rows: 1 Compression: ZLIB Compression size: 262144 Type: struct<_col0:varchar(11)> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false min: 1ºTrimestre max: 1ºTrimestre sum: 12 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false min: 1ºTrimestre max: 1ºTrimestre sum: 12 Stripes: Stripe: offset: 3 data: 21 rows: 1 tail: 46 index: 49 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 38 Stream: column 1 section DATA start: 52 length 15 Stream: column 1 section LENGTH start: 67 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 262 bytes Padding length: 0 bytes Padding ratio: 0% _____________________
Hope this helps!
Created 06-22-2018 01:45 PM
we have different outputs.
I still don't know why this is happening. Any clues?
File Version: 0.12 with HIVE_13083 18/06/22 14:05:55 INFO orc.ReaderImpl: Reading ORC rows from adl://test.azuredatalakestore.net/apps/hive/warehouse/ustemp.db/special_char/000000_0 with {include: null, offset: 0, length: 9223372036854775807} 18/06/22 14:05:55 INFO orc.RecordReaderImpl: Reader schema not provided -- using file schema struct<col1:varchar(11)> Rows: 1 Compression: ZLIB Compression size: 262144 Type: struct<col1:varchar(11)> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false min: 1ºTrimestr max: 1ºTrimestr sum: 11 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false min: 1ºTrimestr max: 1ºTrimestr sum: 11 Stripes: Stripe: offset: 3 data: 20 rows: 1 tail: 35 index: 48 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 37 Stream: column 1 section DATA start: 51 length 14 Stream: column 1 section LENGTH start: 65 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 246 bytes Padding length: 0 bytes Padding ratio: 0%
Created 06-22-2018 04:11 PM
Hey @Hugo Almeida!
Hm that's totally strange 😞
Let's attack orc behaviour, could you check your orc version?
Here's mine.
/usr/hdp/2.6.5.0-292/hive2/lib/hive-orc-2.1.0.2.6.5.0-292.jar
BTW, I'm just concerned if this issue is getting you stuck on your job. If so, you can try to change your datatype to String (besides it's more recommended).
ah, one last thing, are you using hive on tez or hive on mr? Coz I did my test with hive on tez..
Hope this helps