Unicode Standard

The Unicode Standard provides a unique number for each character, no matter the platform, program, or language. It is the universal character encoding standard used to represent text for computer processing.

Unicode provides a consistent way to encode the multilingual plain text, making it easier to exchange text files internationally, as it defines codes for the characters of major languages. This includes punctuation marks, diacritics, mathematical and technical symbols, arrows, dingbats, etc.

Before Unicode was invented, there were hundreds of different encoding systems, The European Union alone required several different systems, while even a single language like English required more than one for all its letters, punctuation, and technical symbols.

Unlike older systems, Unicode allows multiple writing systems to coexist in one data file. Systems that recognize Unicode can consistently read and process data from different languages.

Unicode uses a 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, it assigns each character a unique 16-bit value and does not use complex modes of escape codes. While 65,000 characters are sufficient for encoding thousands of characters used in major languages of the world, the Unicode Standard and ISO 10646 provide an extension mechanism called UTF-16 that allows for encoding as many as a million more characters, without the use of escape codes. This is sufficient for all known character-encoding requirements, including full coverage of all historic scripts of the world.

It should be noted that Unicode encodes scripts for languages, rather than just for languages. Systems that are written for more than one language share sets of graphical symbols that have historically related derivations. The union of all those graphical symbols is treated as a single collection of characters for encoding and is identified as a single script. Many scripts (especially Latin) are used to write many languages.

Unicode covers all the languages that can be written in the following scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syrian, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, Yi, and much more. Please see Appendix 03 List of Supported Scripts.

Depending on the level of Unicode support in the browser used and whether the necessary fonts are installed, you may have display problems for some of the translations, particularly with complex scripts such as Arabic. Saved with a text file, the encoding standard provides the information that the computer needs to display the text on the screen. For example, in the Cyrillic (Microsoft Windows) encoding script, the character Й has the numeric value 201. When a file is opened that contains this character on a computer that uses the Cyrillic (Windows) encoding script, the computer reads the 201 numeric value and displays Й on the screen. However, if the same file is opened on a computer that uses a different encoding script, the computer displays whatever character corresponds to the 201 numeric value in its default encoding standard. For example, if the computer uses the Western European (Windows) encoding script, the character in the original Cyrillic-based file will be displayed as É instead.

In this section, GlobalVision describes best practices relating to the Unicode Standard.

Install all necessary encoding scripts on your computer

Validate your fonts for all languages used

Use fonts that support Unicode