Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

the multiple languages supported including which languages are supported?

Solved Go to solution
Highlighted

the multiple languages supported including which languages are supported?

Expert Contributor

Hi,

I am looking for answer for one of the RFP questions:

The customers core systems are Unicode based and support multiple languages even though English is their corporate language. Can you let me know what are the multiple languages supported including which languages are supported.

Not really sure with the answer. Can you please help.

Thanks,

Sujitha

1 ACCEPTED SOLUTION

Accepted Solutions

Re: the multiple languages supported including which languages are supported?

Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.

Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...

3 REPLIES 3

Re: the multiple languages supported including which languages are supported?

Guru

If the question is data in hive, then by default hive data is UTF8, so languages that are supported with UTF8 will work out of the box. Same with HDFS.

Re: the multiple languages supported including which languages are supported?

Expert Contributor

Hi @Ravi Mutyala,

Thanks for the response. With this I understand that the whole hadoop ecosystem uses UTF-8. Is that correct? Can you confirm.

Thanks,

Sujitha

Re: the multiple languages supported including which languages are supported?

Hive, Pig (by means of PigStorage), and Spark all support UTF-8. However, it's not easy to say which languages are completely supported by UTF-8, because, for example some rarely used CJK characters (like in historical texts) outside of the so-called Basic Multilingual Plane (BMP) are not well supported in practice. Therefore, it's better to list up the languages you plan to use, and ask are they supported. In summary, if a language alphabet is completely included in BMP then it's completely supported.

Edit: For a cool reading (over the weekend?) see this: Would UTF-8 be able to support the inclusion of a vast alien language with millions of new character...

Don't have an account?
Coming from Hortonworks? Activate your account here