Support Questions

Find answers, ask questions, and share your expertise

the multiple languages supported including which languages are supported?

avatar
Super Collaborator

Hi,

I am looking for answer for one of the RFP questions:

The customers core systems are Unicode based and support multiple languages even though English is their corporate language. Can you let me know what are the multiple languages supported including which languages are supported.

Not really sure with the answer. Can you please help.

Thanks,

Sujitha

1 ACCEPTED SOLUTION

avatar
Master Guru

Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.

Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...

View solution in original post

3 REPLIES 3

avatar
Guru

If the question is data in hive, then by default hive data is UTF8, so languages that are supported with UTF8 will work out of the box. Same with HDFS.

avatar
Super Collaborator

Hi @Ravi Mutyala,

Thanks for the response. With this I understand that the whole hadoop ecosystem uses UTF-8. Is that correct? Can you confirm.

Thanks,

Sujitha

avatar
Master Guru

Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.

Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...