Customer is facing issues with French character set, when data is populated to Hive.
Records are getting split when French characters are encountered.
Checking on internet blogs, the recommendation I can find is to implement custom Serde's .
Are there any options to handle french characters in Hive after loading data ?
Or is it recommended to pre-process French characters prior to loading ?
Custom SerDes are always a last resort. What is the encoding of data itself? Hive expects UTF-8 data. If the encoding is, say, ISO/IEC 8859-1, you will need to either convert the data or you can try the feature added in https://issues.apache.org/jira/browse/HIVE-7142
Thank you Carter.
Also another thing to check is your Locale, since it has been known to cause problems:
In Linux, for instance, do a:
and set it to UTF-8 if not already so:
$ export LANG=UTF-8
Let us know if this helps.