Support Questions

Uppi · ‎07-29-2016

Hi,

We have prototyping a datawarehouse solution on Hadoop/Hive. One of the issue arised was whther we should use same datatype e.g. varchar2 in Oracle converts to varchar in Hive. number in Oracle to int in Hive, date in Oracle to timestamp in Hive etc.

I want to understand whether any performance issue arise if we use STRING for VARCHAR2, BIGINT for NUMBER?

This helps us to fasten the DDL process if we have fixed number of datatypes without loss of data/precisions

Regards

Upendra

bpreachuk · ‎07-29-2016

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below).

I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype.

Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉

** Edited to add more info about varchar vs. string **

Here is an example of what happens when you push too much data into a varchar(20) field.

create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;

# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); 

# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); 

# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); 

# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id, 
       length(stringval) as `length_string`, stringval, 
       length(varchar20_val) as `length_varchar`, varchar20_val 
from test_tbl;

The output is as follows:

id length_string stringval                  length_varchar varchar20_val
1  20            01234567890123456789       20             01234567890123456789
2  25            0123456789012345678901234  20             01234567890123456789
3  25            0123456789012345678901234  20             01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s)

In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

View solution in original post

Vijayandra · ‎12-18-2016

@bpreachuk

@ "Also note that varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization."

Vectorization is supported for char and varchar datatypes as per below link:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/query-vectorizat...

bpreachuk · ‎12-19-2016

Thanks for the clarification @Vijayandra Mahadik. You are correct... It was implemented for Hive 0.14. My comments were out of date. I will correct my original statement above. https://issues.apache.org/jira/browse/HIVE-5760

gdhawas · ‎02-22-2017

What about BIGINT and INT?

If I'm having a column with BIGINT as datatype and it's having a value in INT's range, then will it take all the 8bytes or just 4bytes for storing?

,

what about BIGINT and INT?

for example if I'm having a column with BIGINT as datatype and I'm inserting value in INT's range, will it take 8byte(as for BIGINT) or just 4bytes considering it is in INT range?

aervits · ‎02-22-2017

@Gunjan Dhawas Please open this question as a new thread, a good practice on HCC is to create a separate question when an accepted answer is available.

jeff_watson · ‎11-03-2017

Something to consider is the downstream of Hive uses as well. We used String initially (mainly because for the data we were loading, we weren't given the data types, so String allowed us to ignore it). However, when we started using SAS, those String fields all converted to varchar(32k) and caused headaches on that end. We converted to varchar(x).

bpreachuk · ‎11-04-2017

Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS.

With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS.

However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible.

http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1ey...

Use Case

Large Table, All Columns of Type String:
Table A stored in Hive has 40 columns, all of type String, with 500M rows.
By default, SAS Access converts String to $32K. So, 32K in length for char.
The math for this size table yields 1.2MB row length x 500M rows.
This causes the system to come to a halt - Too large to store in LASR or WORK.

The following techniques can be used to work around the challenge in SAS, and they all work:

Use char and varchar in Hive instead of String.
Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in
In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE.
Add formatting to SAS code during inbound reads
- example: Sequence Length 8 Informat 10. format 10.

I hope this helps.

Cloudera Community

Support Questions

Hive STRING vs VARCHAR Performance