THE TOWER OF BABEL

or the problem of displaying scripts on the Internet and the Unicode solution.

Evolution of computer data
storage devices.
San Diego Computer Museum

And the Lord said, “Behold, they are one people, and they have all one language; and this is only the beginning of what they will do; and nothing that they propose to do will now be impossible for them.” (Genesis 11:6)

UNICODE

Most of character encodings in use today have their origins in the ASCII (American Standard Code for Information Interchange) encoding system, which is based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text.

The ASCII code has its origins in 7-bit telegraphic codes used in Bell telecommunication equipment. Although ASCII was declared the standard by the American Standards Association in 1963, it not until 1967 that it developed all the present features and nonprinting signs. The last code update was in 1986 and today the full chart includes 33 non-printing and 95 printing characters. Most of the non-English Latin alphabet languages expanded ASCII and developed their own standards. The most widely used standard in Europe was ISO 8859 which developed in sub groups (8859-1 to 16 and related Windows-1250 series) according to the language groups. The main issue with different encoding schemes is that they are often incompatible, proprietary to software and platforms, and that character mappings overlapped and created confusion, particularly on the Internet.

In 1988, with the development of the computers and the Internet as global phenomena, the problems of numerous character standards become a serious problem. The group of companies affected the most by the chaos – publishing companies, big international companies, and software producers established a consortium, which ultimately designed the Universal Character Set standard known as the Unicode.

According the Unicode Standard Version 3.0 (2000, p. 4) the Unicode Standard was designed to be:

UNIVERSAL – the set of characters should be large enough to encompass all characters that are likely to be used in general text interchange, including those in major international, national, and industry character sets.

EFFICIENT – plain text is simple to parse: software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous.

UNIFORM – A fixed character code allows for efficient sorting, searching, display, and editing of text.

UNAMBIGUOUS – Any given 16-bit value always represents the same character.

The first version of the code were designed with 8-bit base, but it soon become clear that to achieve its goal the code needed longer strings, especially for the large Asian (CJK) character sets. To ease the transition and accommodate the present situation existing standards like ASCII, ISO 8859, KOI8 (developed for the Cyrillic), ASMO (developed for Arab script) were incorporated into the encoding scheme. The ultimate goal of the system is to provide “a unique number for every character, no matter what the platform, no matter what the program, no matter what the language” as the Consortium declares on their web site.

Although design of the script was intended to be culturally neutral, it still draws the criticism especially from the countries using kanji characters. Although the system is build like standardized pieces of mosaics, it is true that it is designed with the Latin alphabet as its cornerstone. The Unicode consortium tried to address this issue with expansion of the kanji character sets and more close cooperation with the local encoding experts.

One of most admired features of Unicode is its ability to manage the display of different scripts on the same page regardless of the script direction. The early software, developed for Latin script had a serious problem processing BiDi texts. In Unicode encoding, all textual information (characters) are stored and processed in the order of input (in the way they are keyed in) and software works out which direction on the page or screen the script should be displayed. The use of multiple scripts on the same page is heavily used in international news web sites, where links are displayed in the original text providing the indication that link would bring text in another script or language. University libraries and research centers with international users and books in many languages also find the good use for this feature. See example of the library of the Tel Aviv University where the text about the Jaffe collection was written in English, but the bibliography of the sources is in Hebrew.

Example of multi script display (from Radović 1981, p. 24)

Cyrillic (Serbian variant)	Знање је досадно, незнање је много занимљивије. Кад знамо, сви знамо исто, а кад не знамо свако не зна друкчије.
Latin (Croatian variant)	Znanje je dosadno, neznanje je mnogo zanimljivije. Kad znam, svi znamo isto, kad ne znamo, svako ne zna drukčije.
English	Knowledge is boring, ignorance is more interesting. What we know, we all know the same way; when we are ignorant, we are ignorant in different ways..

Unicode consolidates characters in the languages using the same script but not across different scripts as could be seen in the example of word Znanje that means the same and is pronounced the same in Serbian and Croatian. The “Latin small letter a” has the hex code 0061 in basic and extended Latin charts, but the same “a” in Cyrillic chart is designated “Cyrillic small letter a” with hex code 0430. It would be possible to consolidate across the scripts but that would make search quite difficult. It is true that the Unicode is quite memory hungry and it would not be possible to implement it without the modern development of computer hardware.

Example of hex Unicode encoding of word knowledge in different scripts

Cyrillic (Serbian)	З	н	а	њ	е
Unicode(hex)	0417	043D	0430	045A	0435
Latin (Croatian)	Z	n	a	nj	e
Unicode(hex)	005A	006E	0061	006E +006A	0065

However, not all of the solutions with the encoding solve the problem of character display (glyphs). To interpret the encoding visually computers require appropriate software and font files. Most of the modern machines do support Unicode but it is not quite feasible to put in every computer font files for every character encoded by the standard. The printout of Version 3.0 from 2000 with all data charts was a 1000 page thick folio book. In 2006, the code is in its fifth version that includes extinct scripts like Cuneiform, Phoenician, and Linear B. For now, software companies are solving the problem by preloading the operating systems with font faces for a particular region, including the most common scripts from other regions. The problem is also that not all browsers support all Unicode character sets. For now, everybody is doing catching up or easing out of the old standards.

The Unicode standard for character encoding and software architecture could solve our technical difficulties with different scripts, but it will not solve the language and script related problems. As many librarians working in the multi-script multi-lingual environment experienced, the main limitation is not the technology but expectations and preferences of the users. While librarians would prefer transcription of foreign titles, users expect transliteration and cataloging in local terms (Kasparova in Byrum 1998, p. 42-47). This creates indexing nightmares and requires maintance of extensive authority records but users simply avoid the use of unfamiliar tools.

Efficiency and convenience would be readily sacrificed for cultural convention as the librarian from a university library in Saudi Arabia experienced first hand. Facing a huge backlog of Arabic material and the lack of cataloging rules in Arab language, and having an integrated English and Arab collection, the library tried to maintain unified catalog by transliterating the Arab titles and using LC cataloging rules and numbers. However, this was “against the general feeling of the Arabic speakers who take pride in their native language and show strong opposition to subordinating Arabic to another language in bibliographic records.” (Khurshid 1992). When it finally decided to automatise its catalog, the library went through special efforts to develop OCLC in Arabic (in addition to English) to accommodate users.

Modern technology, the Internet (as World Wide Web), the Unicode and markup languages make it possible to develop and maintain extensive multi-lingual sites like the International Children’s Digital Library with interfaces in different languages and catalogues that could be searched in multiple languages (Starr 2005). However, the improvements in communication tools do not necessarily improve our ability to communicate with each other.

This web site was developed for the Special Study class
of SJSU School of Library and Information Science
Text and design by Vlasta Radan.
Last update May 18, 2006.