Created 07-29-2016 04:15 PM
Hi,
We have prototyping a datawarehouse solution on Hadoop/Hive. One of the issue arised was whther we should use same datatype e.g. varchar2 in Oracle converts to varchar in Hive. number in Oracle to int in Hive, date in Oracle to timestamp in Hive etc.
I want to understand whether any performance issue arise if we use STRING for VARCHAR2, BIGINT for NUMBER?
This helps us to fasten the DDL process if we have fixed number of datatypes without loss of data/precisions
Regards
Upendra
Created 07-29-2016 04:31 PM
Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).
I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.
Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉
** Edited to add more info about varchar vs. string **
Here is an example of what happens when you push too much data into a varchar(20) field.
create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ; # insert 20 characters of text into both fields insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); # insert 25 characters of text into string field, 20 chars into varchar(20) field insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); # insert 25 characters of text into both fields insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); # Note - row 3 varchar field has the last 5 characters silently truncated. This is bad. select id, length(stringval) as `length_string`, stringval, length(varchar20_val) as `length_varchar`, varchar20_val from test_tbl;
The output is as follows:
id length_string stringval length_varchar varchar20_val 1 20 01234567890123456789 20 01234567890123456789 2 25 0123456789012345678901234 20 01234567890123456789 3 25 0123456789012345678901234 20 01234567890123456789 Time taken: 1.584 seconds, Fetched: 3 row(s)
In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.
Created 07-29-2016 04:27 PM
Hi Upendra,
The recommendation is to use VARCHAR and Integer Types (TINYINT, SMALLINT, INT, BIGINT) where ever possible instead of using String. In hive String is treated as VARCHAR(32762). So, if you have data that is not more than 50 characters in length, using string here will have some overhead. Same for integer types.
Hope this helps.
Created 07-29-2016 04:33 PM
Interesting that there is an upper bound for string datatype. Thanks @Kashif Khan.
Created 07-29-2016 09:24 PM
I am pretty sure that Hive Strings are not max. 32k long. I think the limit is something like 2GB. I am pretty sure if there exists a lower limit then it will be specific to a client or something. But I will verify that when I come back.
That link seems to collaborate that and hive.apache.org also doesn't give a max.
http://stackoverflow.com/questions/35030936/is-there-maximum-size-of-string-data-type-in-hive
Also since both Varchar and String are String values and use dictionaries I am pretty sure the serialization will be pretty much identical. So why would Varchar be better? As I said I don't know but I would be surprised if there was a real difference. I assumed VARCHAR is simply a layer on top of String that checks values during insert.
Created 07-29-2016 04:31 PM
Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).
I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.
Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉
** Edited to add more info about varchar vs. string **
Here is an example of what happens when you push too much data into a varchar(20) field.
create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ; # insert 20 characters of text into both fields insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); # insert 25 characters of text into string field, 20 chars into varchar(20) field insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); # insert 25 characters of text into both fields insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); # Note - row 3 varchar field has the last 5 characters silently truncated. This is bad. select id, length(stringval) as `length_string`, stringval, length(varchar20_val) as `length_varchar`, varchar20_val from test_tbl;
The output is as follows:
id length_string stringval length_varchar varchar20_val 1 20 01234567890123456789 20 01234567890123456789 2 25 0123456789012345678901234 20 01234567890123456789 3 25 0123456789012345678901234 20 01234567890123456789 Time taken: 1.584 seconds, Fetched: 3 row(s)
In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.
Created 07-30-2016 01:01 AM
Created 08-01-2016 09:03 AM
Hi,
Can you please confirm the datatype we should use for VARCHAR. Will String be OK? Can I confirm that STRING and VARCHAR performance will be same?
Regards
Created 08-01-2016 09:42 AM
How about you just try it? I am pretty sure it will be the same but just make two CTAS tables and test it quickly.
Created 08-01-2016 10:04 AM
Hi,
We tried it and performance was pretty much same , difference of couple of seconds. But if we can get official confirmation, we can surely move ahead without any worries for production environment.
I see Vectorization support for String only as a major push for going with String datatype.
Regards
Created 08-01-2016 11:27 AM
Hi @Upendra Bhandari. FYI I updated my original answer to add more info about why to use string instead of varchar.