Support Questions
Find answers, ask questions, and share your expertise

Hive STRING vs VARCHAR Performance

New Contributor

Hi,

We have prototyping a datawarehouse solution on Hadoop/Hive. One of the issue arised was whther we should use same datatype e.g. varchar2 in Oracle converts to varchar in Hive. number in Oracle to int in Hive, date in Oracle to timestamp in Hive etc.

I want to understand whether any performance issue arise if we use STRING for VARCHAR2, BIGINT for NUMBER?

This helps us to fasten the DDL process if we have fixed number of datatypes without loss of data/precisions

Regards

Upendra

1 ACCEPTED SOLUTION

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).

I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.

Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉

** Edited to add more info about varchar vs. string **

Here is an example of what happens when you push too much data into a varchar(20) field.

create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;

# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); 

# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); 

# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); 

# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id, 
       length(stringval) as `length_string`, stringval, 
       length(varchar20_val) as `length_varchar`, varchar20_val 
from test_tbl; 

The output is as follows:

id length_string stringval                  length_varchar varchar20_val
1  20            01234567890123456789       20             01234567890123456789
2  25            0123456789012345678901234  20             01234567890123456789
3  25            0123456789012345678901234  20             01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s)

In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

View solution in original post

15 REPLIES 15

Rising Star

Hi Upendra,

The recommendation is to use VARCHAR and Integer Types (TINYINT, SMALLINT, INT, BIGINT) where ever possible instead of using String. In hive String is treated as VARCHAR(32762). So, if you have data that is not more than 50 characters in length, using string here will have some overhead. Same for integer types.

Hope this helps.

Interesting that there is an upper bound for string datatype. Thanks @Kashif Khan.

I am pretty sure that Hive Strings are not max. 32k long. I think the limit is something like 2GB. I am pretty sure if there exists a lower limit then it will be specific to a client or something. But I will verify that when I come back.

That link seems to collaborate that and hive.apache.org also doesn't give a max.

http://stackoverflow.com/questions/35030936/is-there-maximum-size-of-string-data-type-in-hive

Also since both Varchar and String are String values and use dictionaries I am pretty sure the serialization will be pretty much identical. So why would Varchar be better? As I said I don't know but I would be surprised if there was a real difference. I assumed VARCHAR is simply a layer on top of String that checks values during insert.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-StringColumnSe...

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).

I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.

Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉

** Edited to add more info about varchar vs. string **

Here is an example of what happens when you push too much data into a varchar(20) field.

create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;

# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); 

# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); 

# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); 

# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id, 
       length(stringval) as `length_string`, stringval, 
       length(varchar20_val) as `length_varchar`, varchar20_val 
from test_tbl; 

The output is as follows:

id length_string stringval                  length_varchar varchar20_val
1  20            01234567890123456789       20             01234567890123456789
2  25            0123456789012345678901234  20             01234567890123456789
3  25            0123456789012345678901234  20             01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s)

In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

@Upendra Bhandari

The maximum length of a STRING is 2GB.

New Contributor

Hi,

Can you please confirm the datatype we should use for VARCHAR. Will String be OK? Can I confirm that STRING and VARCHAR performance will be same?

Regards

How about you just try it? I am pretty sure it will be the same but just make two CTAS tables and test it quickly.

New Contributor

Hi,

We tried it and performance was pretty much same , difference of couple of seconds. But if we can get official confirmation, we can surely move ahead without any worries for production environment.

I see Vectorization support for String only as a major push for going with String datatype.

Regards

Hi @Upendra Bhandari. FYI I updated my original answer to add more info about why to use string instead of varchar.

New Contributor

@bpreachuk

@ "Also note that varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization."

Vectorization is supported for char and varchar datatypes as per below link:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/query-vectorizat...

Thanks for the clarification @Vijayandra Mahadik. You are correct... It was implemented for Hive 0.14. My comments were out of date. I will correct my original statement above. https://issues.apache.org/jira/browse/HIVE-5760

New Contributor

What about BIGINT and INT?

If I'm having a column with BIGINT as datatype and it's having a value in INT's range, then will it take all the 8bytes or just 4bytes for storing?

,

what about BIGINT and INT?

for example if I'm having a column with BIGINT as datatype and I'm inserting value in INT's range, will it take 8byte(as for BIGINT) or just 4bytes considering it is in INT range?

Mentor

@Gunjan Dhawas Please open this question as a new thread, a good practice on HCC is to create a separate question when an accepted answer is available.

Contributor

Something to consider is the downstream of Hive uses as well. We used String initially (mainly because for the data we were loading, we weren't given the data types, so String allowed us to ignore it). However, when we started using SAS, those String fields all converted to varchar(32k) and caused headaches on that end. We converted to varchar(x).

Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS.

With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS.

However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible.

http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1ey...

Use Case

  • Large Table, All Columns of Type String:
  • Table A stored in Hive has 40 columns, all of type String, with 500M rows.
  • By default, SAS Access converts String to $32K. So, 32K in length for char.
  • The math for this size table yields 1.2MB row length x 500M rows.
  • This causes the system to come to a halt - Too large to store in LASR or WORK.

The following techniques can be used to work around the challenge in SAS, and they all work:

  • Use char and varchar in Hive instead of String.
  • Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in
  • In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE.
  • Add formatting to SAS code during inbound reads
    • example: Sequence Length 8 Informat 10. format 10.

I hope this helps.

; ;