Support Questions

Uppi · ‎07-29-2016

Hi,

We have prototyping a datawarehouse solution on Hadoop/Hive. One of the issue arised was whther we should use same datatype e.g. varchar2 in Oracle converts to varchar in Hive. number in Oracle to int in Hive, date in Oracle to timestamp in Hive etc.

I want to understand whether any performance issue arise if we use STRING for VARCHAR2, BIGINT for NUMBER?

This helps us to fasten the DDL process if we have fixed number of datatypes without loss of data/precisions

Regards

Upendra

bpreachuk · ‎07-29-2016

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).

I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.

Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉

** Edited to add more info about varchar vs. string **

Here is an example of what happens when you push too much data into a varchar(20) field.

create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;

# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); 

# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); 

# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); 

# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id, 
       length(stringval) as `length_string`, stringval, 
       length(varchar20_val) as `length_varchar`, varchar20_val 
from test_tbl;

The output is as follows:

id length_string stringval                  length_varchar varchar20_val
1  20            01234567890123456789       20             01234567890123456789
2  25            0123456789012345678901234  20             01234567890123456789
3  25            0123456789012345678901234  20             01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s)

In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

View solution in original post

kkhan · ‎07-29-2016

Hi Upendra,

The recommendation is to use VARCHAR and Integer Types (TINYINT, SMALLINT, INT, BIGINT) where ever possible instead of using String. In hive String is treated as VARCHAR(32762). So, if you have data that is not more than 50 characters in length, using string here will have some overhead. Same for integer types.

Hope this helps.

bpreachuk · ‎07-29-2016

Interesting that there is an upper bound for string datatype. Thanks @Kashif Khan.

bleonhardi · ‎07-29-2016

I am pretty sure that Hive Strings are not max. 32k long. I think the limit is something like 2GB. I am pretty sure if there exists a lower limit then it will be specific to a client or something. But I will verify that when I come back.

That link seems to collaborate that and hive.apache.org also doesn't give a max.

http://stackoverflow.com/questions/35030936/is-there-maximum-size-of-string-data-type-in-hive

Also since both Varchar and String are String values and use dictionaries I am pretty sure the serialization will be pretty much identical. So why would Varchar be better? As I said I don't know but I would be surprised if there was a real difference. I assumed VARCHAR is simply a layer on top of String that checks values during insert.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-StringColumnSe...

bpreachuk · ‎07-29-2016

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).

I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.

Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉

** Edited to add more info about varchar vs. string **

Here is an example of what happens when you push too much data into a varchar(20) field.

create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;

# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); 

# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); 

# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); 

# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id, 
       length(stringval) as `length_string`, stringval, 
       length(varchar20_val) as `length_varchar`, varchar20_val 
from test_tbl;

The output is as follows:

id length_string stringval                  length_varchar varchar20_val
1  20            01234567890123456789       20             01234567890123456789
2  25            0123456789012345678901234  20             01234567890123456789
3  25            0123456789012345678901234  20             01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s)

In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

cstanca · ‎07-30-2016

@Upendra Bhandari

The maximum length of a STRING is 2GB.

Uppi · ‎08-01-2016

Hi,

Can you please confirm the datatype we should use for VARCHAR. Will String be OK? Can I confirm that STRING and VARCHAR performance will be same?

Regards

bleonhardi · ‎08-01-2016

How about you just try it? I am pretty sure it will be the same but just make two CTAS tables and test it quickly.

Uppi · ‎08-01-2016

Hi,

We tried it and performance was pretty much same , difference of couple of seconds. But if we can get official confirmation, we can surely move ahead without any worries for production environment.

I see Vectorization support for String only as a major push for going with String datatype.

Regards

bpreachuk · ‎08-01-2016

Hi @Upendra Bhandari. FYI I updated my original answer to add more info about why to use string instead of varchar.

Cloudera Community

Support Questions

Hive STRING vs VARCHAR Performance