Created 06-30-2016 08:03 PM
Hi,
I am looking for answer for one of the RFP questions:
The customers core systems are Unicode based and support multiple languages even though English is their corporate language. Can you let me know what are the multiple languages supported including which languages are supported.
Not really sure with the answer. Can you please help.
Thanks,
Sujitha
Created 07-01-2016 08:55 AM
Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.
Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...
Created 06-30-2016 08:50 PM
If the question is data in hive, then by default hive data is UTF8, so languages that are supported with UTF8 will work out of the box. Same with HDFS.
Created 07-01-2016 12:40 AM
Hi @Ravi Mutyala,
Thanks for the response. With this I understand that the whole hadoop ecosystem uses UTF-8. Is that correct? Can you confirm.
Thanks,
Sujitha
Created 07-01-2016 08:55 AM
Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.
Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...