Support Questions

Find answers, ask questions, and share your expertise

string using VARCHAR type is silently truncated with ORC table but the issue is not seen with textfile

avatar

Creating a orc table with varchar(11) column and insert a value with a special character:

create table special_char (varchar(11)) stored as orc;
insert into special_char values('1ºTrimestre');
select * from special_char;
+--------------------+--+
| special_char.col1  |
+--------------------+--+
| 1ºTrimestr         |
+--------------------+--+

Creating a textfile:

create table special_char_text (varchar(11));
insert into special_char_text values('1ºTrimestre');
select * from special_char_text;
+--------------------+--+
| special_char_text.col1  |
+--------------------+--+
| 1ºTrimestre        |
+--------------------+--+

Can someone explain why the value inside the column of ORC table is truncated?

7 REPLIES 7

avatar

Hi @Hugo Almeida!
Guess you're from Brazil right? As I can see from your insert value (in portuguese hehe).
Unfortunately I wasn't able to simulate this issue (follow my test below) on Hive.

0: jdbc:hive2://node3:10000/default> create table special_char (x varchar(11)) stored as orc;
No rows affected (0.393 seconds)
0: jdbc:hive2://node3:10000/default> insert into special_char values('1ºTrimestre');
INFO  : Session is already open
INFO  : Dag name: insert into special_c...alues('1ºTrimestre')(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1529333672124_0002)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 5.75 s     
--------------------------------------------------------------------------------
INFO  : Loading data to table default.special_char from hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/.hive-staging_hive_2018-06-19_14-41-46_630_2155266942579102600-1/-ext-10000
INFO  : Table default.special_char stats: [numFiles=1, numRows=1, totalSize=260, rawDataSize=96]
No rows affected (6.689 seconds)
0: jdbc:hive2://node3:10000/default> select * from special_char;
+-----------------+--+
| special_char.x  |
+-----------------+--+
| 1ºTrimestre     |
+-----------------+--+
1 row selected (0.109 seconds
But AFAIK, internally the varchar will receive a string value, and if the string can't fit into the varchar pool, then it will truncate.
There's a better and clearly explanation than mine in hive cWiki
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-varchar

Ps: I'm not sure if this is your case, and if we are running the same version.
Hope this helps!

avatar

Hi @Vinicius Higa Murakami, I'm from Portugal 😄 Thanks for trying to repro the issue.

There's something different, can you also check these:

hive --version

Hive 1.2.1000.2.6.1.0-129
describe extended special_char;

| Detailed Table Information  | Table(tableName:special_char, dbName:ustemp, owner:hive, createTime:1529404515, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col1, type:varchar(11), comment:null)], location:adl://azuredatalakestoretest.azuredatalakestore.net/apps/hive/warehouse/ustemp.db/special_char, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=246, numRows=1, rawDataSize=95, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=1, transient_lastDdlTime=1529404574}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)  |          |

I have to say we're using Azure Data Lake Store as a data repository for Hadoop.

avatar

Hi @Hugo Almeida!
Ahh glad to hear that! I'm from Brazil, but living in Chile hehe 🙂
So regarding your hive issue, we're running almost the same version of hive:

[root@node3 ~]# hive --version
Hive 1.2.1000.2.6.5.0-292

Could you recreate the orc table with a bigger varchar? Like 12?
My guess would be that somehow your special character is consuming more than it should for ORC files.
Anyway, I also made other tests against this special char:

hive> select md5('º') from special_char;
OK
91f9bcb8e28b5cb34d63c456efe3a29c
Time taken: 0.077 seconds, Fetched: 1 row(s)
hive> select md5('º') from special_char_text;;
OK
91f9bcb8e28b5cb34d63c456efe3a29c
Time taken: 0.064 seconds, Fetched: 1 row(s)
hive> select ascii('º') from special_char_text;
OK
-62
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select ascii('º') from special_char;
OK
-62
Time taken: 0.072 seconds, Fetched: 1 row(s)
Hope this helps!

avatar

Olá @Vinicius Higa Murakami

Yes, you're right, my special character is consuming more than it should for ORC files. With varchar(12) I don't have any problem.
I also checked the ascii and md5 functions and they return the same values for both tables.

Did you find any differences with the describe extended command?
Thanks

avatar

Ola @Hugo Almeida !
Hm, gotcha.
Yeah, our output from describe extended is barely the same.
Could you try to execute the following cmd, plz?

[hive@node3 ~]$ hive --orcfiledump hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 
Processing data file hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 [length: 262]
Structure for hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0
File Version: 0.12 with HIVE_13083
18/06/21 18:10:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs://Admin-TrainingNS/apps/hive/warehouse/special_char/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
18/06/21 18:10:50 INFO orc.RecordReaderImpl: Reader schema not provided -- using file schema struct<_col0:varchar(11)>
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:varchar(11)>
Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false min: 1ºTrimestre max: 1ºTrimestre sum: 12
File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false min: 1ºTrimestre max: 1ºTrimestre sum: 12
Stripes:
  Stripe: offset: 3 data: 21 rows: 1 tail: 46 index: 49
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 38
    Stream: column 1 section DATA start: 52 length 15
    Stream: column 1 section LENGTH start: 67 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
File length: 262 bytes
Padding length: 0 bytes
Padding ratio: 0%
_____________________

Hope this helps!

avatar

Hi @Vinicius Higa Murakami

we have different outputs.
I still don't know why this is happening. Any clues?

File Version: 0.12 with HIVE_13083
18/06/22 14:05:55 INFO orc.ReaderImpl: Reading ORC rows from adl://test.azuredatalakestore.net/apps/hive/warehouse/ustemp.db/special_char/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
18/06/22 14:05:55 INFO orc.RecordReaderImpl: Reader schema not provided -- using file schema struct<col1:varchar(11)>
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<col1:varchar(11)>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false min: 1ºTrimestr max: 1ºTrimestr sum: 11

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false min: 1ºTrimestr max: 1ºTrimestr sum: 11

Stripes:
  Stripe: offset: 3 data: 20 rows: 1 tail: 35 index: 48
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 37
    Stream: column 1 section DATA start: 51 length 14
    Stream: column 1 section LENGTH start: 65 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 246 bytes
Padding length: 0 bytes
Padding ratio: 0%

avatar

Hey @Hugo Almeida!
Hm that's totally strange 😞
Let's attack orc behaviour, could you check your orc version?
Here's mine.

/usr/hdp/2.6.5.0-292/hive2/lib/hive-orc-2.1.0.2.6.5.0-292.jar

BTW, I'm just concerned if this issue is getting you stuck on your job. If so, you can try to change your datatype to String (besides it's more recommended).

ah, one last thing, are you using hive on tez or hive on mr? Coz I did my test with hive on tez..
Hope this helps