In what scenario we have to choose the binary type or string type in ORC table? It my understanding that, we have to use the type as binary when the datatype string value is more than > 2Gb and If it's below than 2GB then we can go for string data type. Is that correct? I've already seen this : https://orc.apache.org/docs/encodings.html
When we use string type value less than 2GB, getting below error but when switched to binary it's working as expected. Any idea what could be the cause.
" java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. "
This error message "java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. " says that we are hitting limit for Protocol buffers (which is overhead for storing metadata and column statistics) and not the field type limitation .
Looking at the ORCMetadataReader class , this limit PROTOBUF_MESSAGE_MAX_LIMIT is set to 1GB currently which includes whole message and not for single data field .
Please look at this https://issues.apache.org/jira/browse/HIVE-11592
Generally, If there are too many small stripes with many columns or few columns with huge value in string column, overhead of storing metadata(column stats) might be higher than protobuf size.
i.e., In Orc, index data will help to filter out stripes using min/max of the column.
For String Column, min/max & sum of the lengths of the values are recorded in metadata. -- Min/Max of string column is higher than protobuf maxLimit which is throwing this exception.
Whereas for Binary Column, total length of all binary columns are recorded in metadata. -- This will not exceed protobuf maxLimit & will not have min/max values to filter the stripes.
If column values are huge & query will not have filters on such huge columns, its better to use binary data type as it stores less information in metadata.