THE TOWER OF BABEL

or the problem of displaying scripts on the Internet and the Unicode solution.


HOME SCRIPTS UNICODE REFERENCES

Now the whole earth had one language and few words ... They said, “Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves, lest we be scattered aborad upon the face of the whole earth.” (Genesis 11:1 and 3)

 

IN THE BEGINNING

Some of the most complex things are made of few and very simple elements. DNA, the blueprint of life, uses only four bases to create the unique codes for every living thing on Earth. Similarly, every computer uses only two bits of information – 0 and 1 – to perform even the most complicated operations. To calculate the path of the Cassini-Huygens mission to Jupiter or the path through the Los Angeles traffic, all the information is encoded in strings of 0s and 1s (i.e. bit strings of binary code). However, humans do not think or process information in bit strings. We need sound, we need pictures, we need writing on the wall in order to get the message. For that, we use software architecture to translate between the human and machine worlds.

To enter commands into the computer in the form familiar to us – characters – we use a keyboard. Every key on the keyboard is coded by particular scheme that is ultimately translated through the software into machine language (binary code). In order that information encoded on one machine could be correctly interpreted on another machine, they all need to use the same encoding/decoding scheme. At the beginning, that was not really a problem – a computer was one huge machine with a number of slave terminals and all input and output was done in the same circle. Each county could make their data input in their local script or language. The global exchange of the data between various computers was really limited and easy to control.

In the United States, where most of the early computers and software applications were produced, computers used a keyboard with English key mapping (QWERTY). Every key was assigned a string of 7+1-bits and, together with some other nonprinting signs, it formed an encoding scheme known as ASCII (American Standard Code for Information Interchange). All software was programmed to receive commands based on the ASCII keyboard-encoding scheme; all monitors showed only ASCII characters and all printers printed only ASCII fonts.

As soon as room size computers evolved to PCs and started multiplying around the world, it become clear that ASCII based software needed to evolve too. The software users from non-English speaking world had serious problems with processing and outputting information in their native languages and scripts. Every country solved the problems in its own way, changing the keyboard mappings to local standards, usually following the local convention for typewriter keyboards. Each software company offered their own solutions of character mapping used for display and printing. But, as long as all text exchange remained local, things were manageable.

However, with the rise of the Internet, the problem exploded on everybody’s screens. E-mails written in Moscow were displaying as series of blank squares if received in the United States. Because everybody mapped their keyboards on the different way, it was not only the question of a few accented letters that got lost, sometimes the whole file was illegible if opened on a different computer. Different scripts mapped their characters over the characters of the other scripts, and the mappings were software- and platform-specific. The only way to ensure that nothing will be lost in the transition between computers was somehow to send a copy of the printout. In order to exchange text between computers electronically, the standardization of the character mappings was essential.

During the 1980s, in the Xerox and Apple research labs, an idea emerged of a unique mapping of characters where every character of the every script on the world would use unique code associated only with that character – across languages, software programs or operating platforms. The Latin letter A would have character (hex) code 0041 (bit string 01000001) regardless of the font face, software company or language in which it was typed. This unified coding system was named the Unicode.

To ease the transition to a unified system, the most widely used standards, like ASCII (basic Latin), ISO 8859 (extended Latin), KOI8 (Cyrillic) or ASMO (Arabic), were incorporated into the scheme. The encoding expanded the 8-bits used by most of the schemes into 16-bits - providing enough space to encode every script ever used. The Chinese/Japanese/Korean (CJK) ideograms themselves require full use of longer (16-bit or 2-byte) strings to encode the necessary information.

This unique mapping provides easy internationalization of various computer applications, particularly browsers; easy switching between interfaces in different languages/scripts as well as use of multiple scripts in the same page. However, one still needs to have appropriate fonts in the computer in order to display the characters (glyphs) on the screen or print them. The Unicode do not solve all of the multi-lingual multi script problems, mostly rising from inter-dependency between culture, language and script. However, it provides the principle that allows every script, used or extinct, the unique space and identification in the electronic world.