I would like to use Impala in an organisation with data kept in Hebrew.
I read that Impala has some limitations when dealing with Unicode characters.
Is the limitation related only to string comparison and string functions or also for storing and selecting?
Is there a way around it?
Impala treats all string data as byte arrays and does nothing speical if the data is unicode. Impala
can select, store, compare for equality, etc so depending on your use case, this might be sufficient.
The string functions don't work on Unicode data. Only comparing them byte-for-byte. Take an example of the following function:
substr("áele", 1, 1) will return � because it only returns the first byte of the 2-byte character "á"
This is true for other functions like length where doing length("áele") will return 5.
This isn't to bash Impala but to make sure no one is mislead from this thread that Unicode string functions will work for string manipulation.