CURRENT_MEETING_REPORT_ Reported by Borka Jerman-Blazic/Jozef Stefan Institute Minutes of the UCS Character Set BOF (UCS) Introduction A brief introductory tutorial was given by Borka Jerman-Blazic. She described some of the problems which appear on the network due to the lack of support for the national character sets used for inputting, outputting, processing and displaying the text written in languages used all over the world. She stressed the need for proper maintenance of the character integrity over the network. The requirement for processing and interchanging different character sets correctly is especially relevant for some Internet services dealing with names of persons or organizations. Presentation of the Problems Peter Svanberg gave a short overview of the level of support for non-ASCII character sets in different Internet protocols. Some of the protocols were identified as hostile to 8-bit characters. Among them are: DNS, SMTP, FTP, NNTP, WAIS, MIME Text/Enhanced, NFS, AFS, Whois, URN, Gopher, etc. The more recently developed protocols such as MIME part 1 and part 2 as well as some currently on-going projects such as Whois++, as mentioned by Simon Spero, support 16-bit coding and the repertoires provided by such coding. He also mentioned, that several IETF groups developing new protocols/services consider the importance of the proper support of the character sets to be a problem. The level of support for extended character sets in some protocols used on the Internet is included in the Annex below. The next speaker was Masataka Ohta. He presented his view regarding the idea that the International Universal Coding system be recommended for use over the Internet. He identifyed five properties which are required to be present in the recommended coding system: 1. Identity for encoding and decoding, which he understands as unique mapping between particular graphic character and its code (bit combination); 2. Causality, understood as independence of a processed coded character from the other incoming characters in the data stream; 3. Finite state recognition, state dependence of the code required for presentation/display of multi-octet coded data; 1 4. Finite resynchronizability, which means that the state of automation can be determined uniquely by reading a fixed, finite number of octets; and 5. Equality, requirement that a character coded with a different coding system can always be recognized as the same character. Masataka looked for the required properties in ISO 10646 and found out that full ISO 10646 (UCS4) satisfies none of the required properties. He also pointed out that ISO 10646 level 1 satisfies all of the required properties for the European languages. He proposed an extension to the existing UCS code system consisting of five additional bits which will enable the deficiency of the UCS coding system to be overcome. The discussion showed that the proposed solution is not in the general stream of the development of the standard character set codes and their applications in the computing systems. One of the possible solutions to the problems identified by Masataka could be the use of the whole model of UCS, i.e., the four envisaged octets which define the cell and row position for a character in the Multilingual Basic Plane of ISO 10646 additional planes and groups. There was a proposal that the required five additional bits be coded as a private plane in the UCS scheme. John Klensin noted that such an approach could clash with the reassignment of such a plane in the standardization process of ISO JTC1/SC2. In the discussion the problem of the handling of bidirectional text was also identified. Masataka said that one of the five additional bits in his scheme is intended to be used for indication of bidirectional text. Harald Alvestrand pointed out that what is happening now is a sort of transition period between 8-bit coding and 16-bit coding provided with UCS. Another parallel stream for support of different national character sets is ``character switching'' which is enabled by use of the code extension technique of ISO 2022. It was obvious that this scheme is not of practical use for the Internet except for special cases, i.e, the Japanese e-mail solution. Conclusions The attendees then discussed possible work items which will result if the IESG approves the formation of a working group. The chair identified several documents which deal with character set problems such as: RFC 1345, ``Character Mnemonics & Character Sets,'' the Internet-Draft, ``X.400 use of extended character sets,'' and the Internet-Draft, ``Characters and character sets for various languages.'' John Klensin pointed out that special precautions have to be taken in the recommendation of UTF-2 as a data interchange method over the Internet in connection with the possible assignments of additional coding planes by JTC1/SC2. He also recommended the use of a mailing list already working within IETF, ietf-charsets@innosoft.com. The mailing list of the RARE working group on character sets could be added 2 to that mailing list. Other items were discussed and proposed by the BOF attendees. It was decided that the IESG will be asked to consider the possibility of setting up a working group to produce the following: o A document defining how UCS can be used in a uniform way in Internet protocols, especially taking into consideration the UTF-2 encoding of UCS. The document will provide guidance to other protocols which have to deal with these items over the Internet. o A document identifying the languages and the characters required for coding text written in a particular natural language (a sort of guideline for services dealing with multilinguality such as NIR service based on the usage of plain text). o A document defining a tool for coded character set conversion to be provided within some services such as e-mail user agent including fall-back representation of incoming characters that are outside the supported character repertoire of the receiver. o A proposal for extending the mandatory issues which have to be covered in the RFC standardization process to include character set consideration and support. Annex The level of support for extended character sets in some Internet Standard protocols. 3 ____________________________________________________________________ | CharSet | |CharSet | | |_Support_|Protocol____________S|upport_|``Next_Generation''_Protocol_| | 1 |SMTP | 3 |ESMTP | | 1 |RFC822 | 4 |MIME part 1 + part 2 | | 1 |DNS | | | | 2 |FTP | | | | 3 |Telnet | | | | 2 |NNTP | | | | 2 |Finger | | | | 2 |POP3 | | | | 2 |IMAP2 | 3 |IMAP2bis | | 1 |NFS | | | | 1 |AFS | | | | 2 |MIME Text/Enhanced | | | | ? |MIME Text/simplemail | | | | 3 |STIF | | | | 2 |Gopher | 3 |Gopher + | | 1 |WAIS | | | | ? |Prospero | | | | 2 |HTML | | | | 2 |Whois | 3 |Whois ++ | | 2 |URL | | | | 2 |URN | | | |____3____|URM__________________|______|____________________________| Legend: 1 -- hostile against 8-bit characters 2 -- no support for different character sets 3 -- some support for different character sets 4 -- well thought-out support for different character sets 5 -- uniform treatment of all characters Attendees Harald Alvestrand Harald.Alvestrand@uninett.no Piet Bovenga p.bovenga@uci.kun.nl Maria Dimou-Zacharova dimou@dxcern.cern.ch Tim Dixon dixon@rare.nl Olle Jarnefors ojarnef@admin.kth.se Borka Jerman-Blazic jerman-blazic@ijs.si Tomaz Kalin kalin@rare.nl John Klensin Klensin@infoods.unu.edu Pekka Kytolaakso pekka.kytolaakso@csc.fi Thomas Lenggenhager lenggenhager@switch.ch Jun Matsukata jm@eng.isas.ac.jp Keith Moore moore@cs.utk.edu Masataka Ohta mohta@cc.titech.ac.jp Geir Pedersen Geir.Pedersen@usit.uio.no 4 Luc Rooijakkers lwj@cs.kun.nl Rickard Schoultz schoultz@admin.kth.se Milan Sova sova@feld.cvut.cz Simon Spero simon_spero@unc.edu Peter Svanberg psv@nada.kth.se Guido van Rossum guido@cwi.nl 5