Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive STRING vs VARCHAR Performance

Solved Go to solution
Highlighted

Re: Hive STRING vs VARCHAR Performance

New Contributor

@bpreachuk

@ "Also note that varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization."

Vectorization is supported for char and varchar datatypes as per below link:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/query-vectorizat...

Highlighted

Re: Hive STRING vs VARCHAR Performance

Thanks for the clarification @Vijayandra Mahadik. You are correct... It was implemented for Hive 0.14. My comments were out of date. I will correct my original statement above. https://issues.apache.org/jira/browse/HIVE-5760

Highlighted

Re: Hive STRING vs VARCHAR Performance

New Contributor

What about BIGINT and INT?

If I'm having a column with BIGINT as datatype and it's having a value in INT's range, then will it take all the 8bytes or just 4bytes for storing?

,

what about BIGINT and INT?

for example if I'm having a column with BIGINT as datatype and I'm inserting value in INT's range, will it take 8byte(as for BIGINT) or just 4bytes considering it is in INT range?

Highlighted

Re: Hive STRING vs VARCHAR Performance

Mentor

@Gunjan Dhawas Please open this question as a new thread, a good practice on HCC is to create a separate question when an accepted answer is available.

Highlighted

Re: Hive STRING vs VARCHAR Performance

Contributor

Something to consider is the downstream of Hive uses as well. We used String initially (mainly because for the data we were loading, we weren't given the data types, so String allowed us to ignore it). However, when we started using SAS, those String fields all converted to varchar(32k) and caused headaches on that end. We converted to varchar(x).

Re: Hive STRING vs VARCHAR Performance

Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS.

With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS.

However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible.

http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1ey...

Use Case

  • Large Table, All Columns of Type String:
  • Table A stored in Hive has 40 columns, all of type String, with 500M rows.
  • By default, SAS Access converts String to $32K. So, 32K in length for char.
  • The math for this size table yields 1.2MB row length x 500M rows.
  • This causes the system to come to a halt - Too large to store in LASR or WORK.

The following techniques can be used to work around the challenge in SAS, and they all work:

  • Use char and varchar in Hive instead of String.
  • Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in
  • In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE.
  • Add formatting to SAS code during inbound reads
    • example: Sequence Length 8 Informat 10. format 10.

I hope this helps.

Don't have an account?
Coming from Hortonworks? Activate your account here