11.07.09

Unicode or Malayalam?

Posted in Uncategorized at 10:46 pm by Pirate Praveen

When Unicode Consortium decided Malayalam ‘nta’ will no more be ‘na chandrakkala rra’ but ‘chillu na chandrakala rra’ from Unicode 5.1, a thought occured to me, “Unicode Consortium used to encode Malayalam and now whatever Unicode Consortium decides to encode becomes Malayalam in the digital world!”

This post is attempt to summarise my experiences with the Unicode encoding process, mainly based on my participation in the Indic Unicode mailing list discussions relating to Malayalam encoding issues. (If you are a participating member or have direct access to participating members and happens to see this post, you may correct any mistakes by commenting below.)

Free/Open Standard definition ( http://www.fsfe.org/projects/os/def.en.html )

An Open Standard refers to a format or protocol that is

1. subject to full public assessment and use without constraints in a manner equally available to all parties;
2. without any components or extensions that have dependencies on formats or protocols that do not meet the definition of an Open Standard themselves;
3. free from legal or technical clauses that limit its utilisation by any party or in any business model;
4. managed and further developed independently of any single vendor in a process open to the equal participation of competitors and third parties;
5. available in multiple complete implementations by competing vendors, or as a complete implementation equally available to all parties.

What is Unicode? Unicode promise. Unicode process.

Unicode Consortium maintains the Unicode character encoding standard (Every character in the world is given a unique number). Its members include technology companies like Microsoft, Google, Adobe … Ministry of Information Technology and Tamil Nadu government are the only two members from India.

Unicode Consortium promises to enable people around the world to use computers in their language. Unicode publishes a character set/table which contains all the characters of supported languages of the world and locale data specific to each language. Locale specific data includes details of local currency, time format, day of the week…

Unicode Consortium has different types of members including individuals and organizations. Individuals does not have voting rights. Different levels of memberships and privileges associated with each level is explained in http://www.unicode.org/consortium/levels.html

Why it is not a Free/Open Standard? Prohibitive costs for participation.

It fails to qualify the second part of the fourth point in the Open Standards definition ie, “managed in a process open to the equal participation of competitors and third parties”. Though this is arguably a week point as it is generally agreed to be costly and people without money or power are generally not expected to participate. Full membership costs 15,000 USD ie, around 7 lakh INR. (If you don’t have that much money* you can be half member with half votes for only 7,500 USD!). Individuals and students get a lower rate of 150 USD and 50 USD respectively, but they cannot vote.

In addition to failing the Open Standards criteria, its basic premise of encoding the representational forms fails to address the logic of conjunct formation inherent to many languages including most Indian languages.

Some issues that came up in the recent past which encouraged me to look at the whole process in more basic terms.

1. Lifetime of data

One important characteristics of a digital standard is being able to interact (read/write) with the data at any point in its lifetime. What use does a digital standard serve if it cannot handle the data created now in years from now?

When the issue of giving separate code points to Malayalam chillu characters were discussed, questions about handling existing data encoded in current standard was raised. Unicode did not respond to this concern, there were comments floating around in the indic Unicode discussion email list. The suggestion was to convert data to new encoding like they did for Myanmar.

“Myanmar is document[ed] as an exceptional case. If more languages like Malayalam, Sinhala, Hindi, Marathi etc later be added to this exception list, it is better to amend the Stability policy and declare any sequence or codepoint may be deprecated at any time if any bureaucratic request is made and the committee becomes “convinced”.

Can the committee members be held responsible for any damages made by encoding Chillus atomically ? ”

Ralminov Rosnovski on 08th August 2007 (Indic list)

The responsibility of being able to decode data at a later time is bounced to special ability of applications. A serious loss of credibility for the standard.

2. Backward compatibility/Stability Policy

UTC promises to keep its standard backward compatible with earlier versions. Now it has broken that promise by encoding chillus in a different way without providing canonical equivalence or specifying any kind of directives regarding the existing encoding.

“In each new version of the Unicode Standard, the Unicode Consortium may add characters or make certain changes to characters that were encoded in a previous version of the standard. However, the Consortium imposes limitations on the types of changes that can be made, in an effort to minimize the impact on existing implementations. ”

Unicode Character Encoding Stability Policy ( http://unicode.org/policies/stability_policy.html )

3. Dual/multiple encoding

The role of an encoding standard is to uniquely represent a character (Unicode had decided earlier to encode only basic characters) The logic of the language decides encoding of conjuncts and other derived forms. Earlier proposal suggested using zero width joiner to represent chillu form of a consonant (or pure consonant). Now both forms of representing chillu characters can be used to encode them. In case of conjunct ‘nta’ two forms already exist, ie ‘na chandrakala rra’ based on conjunct formation logic of Malayalam and ‘chillu na chandrakkala rra’ in Unicode 5.1 standard. These two sequences are differentiated as ‘Malayalam nta’ and ‘Microsoft nta’ (The story behind this illogical encoding is explained later). Now more sequences are coming when both base consonants of nta gets new characters from grantha lipi. A total of 5 encodings for nta will be possible.

4. Selective addressing of same issues

ZWJ and ZWNJ** share the same issue of zero collision weight. Even after the encoding of atomic chillus some words still has to use ZWNJ. Different base characters forming same graphical representation (ra/rra chillu) was mentioned as another reason for encoding chillus atomically. But the same issue is present for pre-base form of these characters too.

So it appears the only interest is encoding chillus atomically rather than solving the underlying issues.

5. Integrity of standardization process

A standardization body like UTC or ISO earns respect and trust based on the integrity of the process. After Microsoft was able to push OOXML through ISO resorting to all kind of underground tactics, ISO cannot expect to receive the same kind of respect and trust from users of their standards. Some of the recent decisions made by UTC (Microsoft nta for example) have contributed to this erosion of trust in standardization bodies.

6. Secretive and privileged decision making process

“The atomic Chillus were accepted because Canadian representative Umamaheswaran was satisfied and showed the rest of us that there were minimal pairs that couldn’t be resolved without them.”

Michael Everson, Evertype on 3rd Aug 2007 (Indic list)

Face to face access to the members were the only criteria for arriving at a decision.

7. No consideration for logic of the language

‘nta’ was encoded to be compatible with Microsoft’s Karthika font which did not recognise the logical sequence of ‘na chandrakala rra’. Instead it was encoded as ‘chillu na chandrakala rra’ forgetting the very function of a chillu character, which is to prevent forming conjucts when a sequence would otherwise form a conjunct.

8. Language experts are not included in the process

The recommendations of language experts were not even read by the decision makers and personal access to the members were the single criteria for encoding atomic chillus.

“I noticed the timing of such letters by the sender(s) .. always sent a couple of days prior to an important meeting and the delegates are traveling; there is iffy internet connections to one’s own email beihind some firewall … and thereby no way to consider these email messages prior to that meeting !!”

Umamaheswaran, IBM Toronto Lab on 08th August 2007 (Indic list)

At least the honesty in admitting the fact that they did not have time to consider expert opinions is to be appreciated.

9. Proprietary business interests overrides language logic and expert opinions

Even though a proposed change has far reaching consequences, there is no responsibility on the part of the proposer to respond to concerns.

“You’re asking supporters to make their case. Note, though, that with the characters included in the ballot, theirs becomes the default position: they don’t need to answer these questions to achieve their goal unless they have reason to believe that ISO member national bodies will be voting against the ballot with comments that these characters need to be removed from the ballot.
At this point, for better or worse, I suspect it is highly unlikely that national bodies will ask for these to be removed from the ballot.”

Peter Constable, Microsoft on 3rd Aug 2007 (Indic list)

10. UTC is there to serve the interest of a handful of proprietary companies

Microsoft’s Karthika font did not form a conjust with ‘na chandrakala rra’ so named sequence for nta is specified as ‘chillu n chandrakala rra’. This is like cutting your leg to fit the shoe.

This the case with closely monitored language like Malayalam. We can only imagine how other languages are affected. Encoding Dravidian zha in Devanagari is one case that comes to mind.

So how do we move forward? Time for making some difficult choices.

Unicode or Malayalam?

* You are eligible for this offer only if you have less than 500 employees.
** Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) are Unicode special characters used in Malayalam. (It is also used by many other languages like Sinhala and Persian.)

PS: There is not a single Malayalam font that I know which is Unicode 5.1 compliant.