Internationalization of URLs
(mduerst@ifi.unizh.ch) This
represents my view in 1996, it is outdated. For an update, click
here. This document summarizes my impressions and insights from earlier
discussions on the net as well as some proposals for progress.
Overview
URLs are identifiers for resources on the internet. Due to the way their
syntax is currently defined, they are in some cases very obviously self-explaining
(hypothetical example: mailto:john.smith@ibm.com), whereas in other cases,
this is completely different (immagine a similar address in Japan, that
has to be written in Latin letters to conform to URL syntax, but that is
not as understandable to a reader that is used mainly to Japanese characters).
There is therefore a clear and certain need for the internationalization
of URLs. However, this need up to now has only met several obstacles, and
has not yet found an appropriate solution.
URLs as Seen by their Inventors
To understand the full magnitude of the problem of internationalizing
URLs, it is important to understand what the creators of URLs and related
items (URNs, URIs,...) had intended them for.
- The general URL syntax only fixes the first part of the URL, up to
a ":", which indicates the "scheme". Athough some commonalities
exist among schemes with similar specification requirements, the greater
part of an URL is more related to the specific scheme/protocol than to
URLs in general.
- URLs use (ASCII) characters for their representation on paper or monitors,
but they actually are defined as a sequence of octets, without any general
specification of the character semantics of these octets.
- Any octet value can be represented, but for those octet values outside
a well-defined set (coinciding with a subset of corresponding printable
ASCII characters), an escape mechanism of the form %XX (the XX representing
the octet in hexadecimal representation) has to be used.
- By their inventors, URLs are seen as abstract identifiers that, very
simmilar to telephone numbers, do not need to give any association with
the content or location of the resources they point to.
- URLs are mainly intended for computer use, and not necessarily for
direct use by a human.
URLs in Practice
To the above abstract view contrasts the practical daily use of URLs:
- URLs are more or less mnemonic, and are designed to be mnemotic. In
some cases, this results from the fact that some schemes more or less directly
represent notations such as domain names, file names, and user names that
are themselves mostly mnemonic. In other cases, the above are specifically
arranged to result in self-speaking and easily memorizable URLs.
- Memorizable and self-speaking URLs are a very important competitive
advantage on the internet, as can be seen from problems with domain name
reservations.
- URLs are, for the absolutely overwhelming majority, interpreted as
ASCII character strings, i.e. despite the fact that this is not specified
in the URL definition, there is a one-to-one correspondence between octet
values and meaningful ASCII characters.
- URLs are used not only among computers, but in daily life on paper.
The main places today are the press and name cards.
- The fact that only English and very few other languages can be directly
represented in URLs (assuming ASCII interpretation) gives a competitive
disadvantage to most users worldwide.
- Also, there is the problem of identification. Only English (+a few
other languages') users find their names, the names of their companies,
and so on, in URLs as they are used to see them in daily life.
Urgency of Internationalization of Different Schemes and Parts
For different schemes/protocols and for different parts of these schemes,
e.g. representing domain names, path names, file names, or arbitrary data,
the urgency of internationalization differs widely. Roughly, the following
categories can be distinguished:
- Resources that should be accessible by anyone around the world, without
any specific script, language, or keyboard knowledge. Email addresses are
a good example. If my email address were mdürst@ifi.unizh.ch, I could
easily print this on my namecard. But the persons trying to contact me
from the U.S. or Japan, or many other countries, would probably not be
able to type this address into their computer.
- Resources that are in a particular language or script and where therefore
an identification in the same language and script seems most appropriate.
An English/ASCII identification may help an outsider understand some general
structure, but it may very well be said that if he doesn't get a clue about
what the identifier is about, he may well get the clue that he won't be
able to understand the resource itself anyway. Examples here are document
names as they appear frequently in the ftp and http schemes.
- Resource identifications that are by and in itself in a particular
language and script. As an example, take the input to a Japanese-English
dictionary, formulated as an URL with query part.
Some Possible Solutions
The need for serious improvement of the situation is well recognized
among the comunity concerned with internationalization. It seems hovewer
to be less of a concern for those communities controlling URLs in general
and for those controlling the respective protocols. The decisive factor
for a serious internationalization of URLs therefore migth be more organizatorial/political
than technical. On the technical side, solutions that don't break existing
implementations (difficult to check) and are otherwise unobtrusive may
be preferred. On the organizatorial level, I see the importance that the
I18N community agrees on
- What do we really need internationalization for, and where may we live
without?
- What is our preferred solution?
Without such a consensus, I see it very difficult to get others to accept
our proposals. Before becomming too pessimistic, here a few actual proposals,
some of which can be combined together:
- Define a "default" character semantic for URL octet sequences.
Obviously, this would be ASCII+something.
- One possibility is UTF-8, as proposed by François Yergeau. It
looks feasible for Western European languages, but the problem is that
it would expand a single ideographic character into NINE URL characters
(an ideographic character being represented with 3 octets in UTF-8, each
of which again is expanded to 3 characters with the %XX URL escape mechanism).
- Another possibility is UTF-7. UTF-7 has very good expansion properties,
especially also because the characters used from BASE64 do not have to
be escaped by %XX (maybe with the exception of "/"). UTF-7 might,
on the other hand, clash with already existing URLs, e.g. an URL ftp://ftp.xxx.com/a+-b,
whose current interpretation for the last part is "a+-b", which
however with UTF-7 would mean "a+b", or, to conserve the original
meaning, would have to be written as "a+--b". This problem is
probably about as large as the problem of clashes in case UTF-8 is used.
- Use a special escape mechanism for extended character semantics. An
easy proposal would be to prefix all UTF-7 sequences by a "%".
Because in current URLs, the sequence "%+" does not appear, this
is completely unambiguous. For the above example, it would be either "a+-b"
or "a%+--b", both expressing the same. The problem here is that
such a syntax might break existing implementations.
- Another convention might be to indicate the encoding (MIME "charset")
directly in the URL, as is done in MIME headers. Clearly, this might have
been necessary at the time the extensions for MIME headers were designed,
but now that UCS is available, it would be a great anachronism.
- A special problem here is that with such conventions, there will be
two representations of URLs, both of which might be necessary for a transitory
phase. For example, it might be possible to give an URL with Chinese characters,
but also its representation in standard URL syntax (with %XX escapes and
so on) would be necessary for use on old-style browsers.
- Introduce new protocols or protocol features that work with internationalization.
E.g. define a new protocol "iftp" and the corresponding URL scheme
that allows filenames from UCS (universal character set), with conversions
from local character sets used to encode filenames to a common notation.
- Introduce conventions that help in identifying already used character
encodings. For example, add a facility to the ftp protocol that allows
the browser to know in what encoding the server keeps its file names, so
that it can translate these file names appropriately for the user. This
could be as easy as defining a specific file to contain the MIME "charset"
parameter value for the server-wide filename encoding.
Other Proposals/Discussions
Conclusions
There is great need for an internationalization of URLs, but there are
still many problems on a conceptional level, on a technical level, and
on an organizatorial/political level that have to be attacked. I hope that
we can make as much progress as possible at the upcomming workshop.
Last updated May 4th, 1996, mduerst@ifi.unizh.ch