Created 07-29-2016 04:15 PM
Hi,
We have prototyping a datawarehouse solution on Hadoop/Hive. One of the issue arised was whther we should use same datatype e.g. varchar2 in Oracle converts to varchar in Hive. number in Oracle to int in Hive, date in Oracle to timestamp in Hive etc.
I want to understand whether any performance issue arise if we use STRING for VARCHAR2, BIGINT for NUMBER?
This helps us to fasten the DDL process if we have fixed number of datatypes without loss of data/precisions
Regards
Upendra
Created 07-29-2016 04:31 PM
Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).
I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.
Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉
** Edited to add more info about varchar vs. string **
Here is an example of what happens when you push too much data into a varchar(20) field.
create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ; # insert 20 characters of text into both fields insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); # insert 25 characters of text into string field, 20 chars into varchar(20) field insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); # insert 25 characters of text into both fields insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); # Note - row 3 varchar field has the last 5 characters silently truncated. This is bad. select id, length(stringval) as `length_string`, stringval, length(varchar20_val) as `length_varchar`, varchar20_val from test_tbl;
The output is as follows:
id length_string stringval length_varchar varchar20_val 1 20 01234567890123456789 20 01234567890123456789 2 25 0123456789012345678901234 20 01234567890123456789 3 25 0123456789012345678901234 20 01234567890123456789 Time taken: 1.584 seconds, Fetched: 3 row(s)
In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.
Created 12-18-2016 02:22 PM
Created 12-19-2016 12:22 AM
Thanks for the clarification @Vijayandra Mahadik. You are correct... It was implemented for Hive 0.14. My comments were out of date. I will correct my original statement above. https://issues.apache.org/jira/browse/HIVE-5760
Created 02-22-2017 06:18 PM
What about BIGINT and INT?
If I'm having a column with BIGINT as datatype and it's having a value in INT's range, then will it take all the 8bytes or just 4bytes for storing?
,what about BIGINT and INT?
for example if I'm having a column with BIGINT as datatype and I'm inserting value in INT's range, will it take 8byte(as for BIGINT) or just 4bytes considering it is in INT range?
Created 02-22-2017 07:43 PM
@Gunjan Dhawas Please open this question as a new thread, a good practice on HCC is to create a separate question when an accepted answer is available.
Created 11-03-2017 07:08 PM
Something to consider is the downstream of Hive uses as well. We used String initially (mainly because for the data we were loading, we weren't given the data types, so String allowed us to ignore it). However, when we started using SAS, those String fields all converted to varchar(32k) and caused headaches on that end. We converted to varchar(x).
Created 11-04-2017 12:19 PM
Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS.
With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS.
However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible.
Use Case
The following techniques can be used to work around the challenge in SAS, and they all work:
I hope this helps.