Node:Extended Char Intro,
Next:Charset Function Overview,
Up:Character Set Handling
Introduction to Extended Characters
A variety of solutions to overcome the differences between
character sets with a 1:1 relation between bytes and characters and
character sets with ratios of 2:1 or 4:1 exist. The remainder of this
section gives a few examples to help understand the design decisions
made while developing the functionality of the C library.
A distinction we have to make right away is between internal and
external representation. Internal representation means the
representation used by a program while keeping the text in memory.
External representations are used when text is stored or transmitted
through whatever communication channel. Examples of external
representations include files lying in a directory that are going to be
read and parsed.
Traditionally there has been no difference between the two representations.
It was equally comfortable and useful to use the same single-byte
representation internally and externally. This changes with more and
larger character sets.
One of the problems to overcome with the internal representation is
handling text that is externally encoded using different character
sets. Assume a program which reads two texts and compares them using
some metric. The comparison can be usefully done only if the texts are
internally kept in a common format.
For such a common format (= character set) eight bits are certainly
no longer enough. So the smallest entity will have to grow: wide
characters will now be used. Instead of one byte, two or four will
be used instead. (Three are not good to address in memory and more
than four bytes seem not to be necessary).
As shown in some other part of this manual,
there exists a completely new family of functions which can handle texts
of this kind in memory. The most commonly used character sets for such
internal wide character representations are Unicode and ISO 10646
(also known as UCS for Universal Character Set). Unicode was originally
planned as a 16-bit character set, whereas ISO 10646 was designed to
be a 31-bit large code space. The two standards are practically identical.
They have the same character repertoire and code table, but Unicode specifies
added semantics. At the moment, only characters in the first 0x10000
code positions (the so-called Basic Multilingual Plane, BMP) have been
assigned, but the assignment of more specialized characters outside this
16-bit space is already in progress. A number of encodings have been
defined for Unicode and ISO 10646 characters:
UCS-2 is a 16-bit word that can only represent characters
from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
and ISO 10646 character, UTF-8 is an ASCII compatible encoding where
ASCII characters are represented by ASCII bytes and non-ASCII characters
by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
of UCS-2 in which pairs of certain UCS-2 words can be used to encode
non-BMP characters up to 0x10ffff
.
To represent wide characters the char
type is not suitable. For
this reason the ISO C standard introduces a new type which is
designed to keep one character of a wide character string. To maintain
the similarity there is also a type corresponding to int
for
those functions which take a single wide character.
This data type is used as the base type for wide character strings.
I.e., arrays of objects of this type are the equivalent of char[]
for multibyte character strings. The type is defined in stddef.h .
The ISO C90 standard, where this type was introduced, does not say
anything specific about the representation. It only requires that this
type is capable of storing all elements of the basic character set.
Therefore it would be legitimate to define wchar_t as
char . This might make sense for embedded systems.
But for GNU systems this type is always 32 bits wide. It is therefore
capable of representing all UCS-4 values and therefore covering all of
ISO 10646. Some Unix systems define wchar_t as a 16-bit type and
thereby follow Unicode very strictly. This is perfectly fine with the
standard but it also means that to represent all characters from Unicode
and ISO 10646 one has to use UTF-16 surrogate characters which is in
fact a multi-wide-character encoding. But this contradicts the purpose
of the wchar_t type.
|
wint_t is a data type used for parameters and variables which
contain a single wide character. As the name already suggests it is the
equivalent to int when using the normal char strings. The
types wchar_t and wint_t have often the same
representation if their size if 32 bits wide but if wchar_t is
defined as char the type wint_t must be defined as
int due to the parameter promotion.
This type is defined in wchar.h and got introduced in
Amendment 1 to ISO C90.
|
As there are for the char
data type there also exist macros
specifying the minimum and maximum value representable in an object of
type wchar_t
.
The macro WCHAR_MIN evaluates to the minimum value representable
by an object of type wint_t .
This macro got introduced in Amendment 1 to ISO C90.
|
The macro WCHAR_MAX evaluates to the maximum value representable
by an object of type wint_t .
This macro got introduced in Amendment 1 to ISO C90.
|
Another special wide character value is the equivalent to EOF
.
The macro WEOF evaluates to a constant expression of type
wint_t whose value is different from any member of the extended
character set.
WEOF need not be the same value as EOF and unlike
EOF it also need not be negative. I.e., sloppy code like
{
int c;
...
while ((c = getc (fp)) < 0)
...
}
has to be rewritten to explicitly use WEOF when wide characters
are used.
{
wint_t c;
...
while ((c = wgetc (fp)) != WEOF)
...
}
This macro was introduced in Amendment 1 to ISO C90 and is
defined in wchar.h .
|
These internal representations present problems when it comes to storing
and transmittal, since a single wide character consists of more
than one byte they are effected by byte-ordering. I.e., machines with
different endianesses would see different value accessing the same data.
This also applies for communication protocols which are all byte-based
and therefore the sender has to decide about splitting the wide
character in bytes. A last (but not least important) point is that wide
characters often require more storage space than an customized byte
oriented character set.
For all the above reasons, an external encoding which is different
from the internal encoding is often used if the latter is UCS-2 or UCS-4.
The external encoding is byte-based and can be chosen appropriately for
the environment and for the texts to be handled. There exist a variety
of different character sets which can be used for this external
encoding. Information which will not be exhaustively presented
here-instead, a description of the major groups will suffice. All of
the ASCII-based character sets [_bkoz_: do you mean Roman character
sets? If not, what do you mean here?] fulfill one requirement: they are
"filesystem safe". This means that the character '/'
is used in
the encoding only to represent itself. Things are a bit
different for character sets like EBCDIC (Extended Binary Coded Decimal
Interchange Code, a character set family used by IBM) but if the
operation system does not understand EBCDIC directly the parameters to
system calls have to be converted first anyhow.
- The simplest character sets are single-byte character sets. There can be
only up to 256 characters (for 8 bit character sets) which is not
sufficient to cover all languages but might be sufficient to handle a
specific text. Another reason to choose this is because of constraints
from interaction with other programs (which might not be 8-bit clean).
- The ISO 2022 standard defines a mechanism for extended character
sets where one character can be represented by more than one
byte. This is achieved by associating a state with the text. Embedded
in the text can be characters which can be used to change the state.
Each byte in the text might have a different interpretation in each
state. The state might even influence whether a given byte stands for a
character on its own or whether it has to be combined with some more
bytes.
In most uses of ISO 2022 the defined character sets do not allow
state changes which cover more than the next character. This has the
big advantage that whenever one can identify the beginning of the byte
sequence of a character one can interpret a text correctly. Examples of
character sets using this policy are the various EUC character sets
(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
or SJIS (Shift-JIS, a Japanese encoding).
But there are also character sets using a state which is valid for more
than one character and has to be changed by another byte sequence.
Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
- Early attempts to fix 8 bit character sets for other languages using the
Roman alphabet lead to character sets like ISO 6937. Here bytes
representing characters like the acute accent do not produce output
themselves: one has to combine them with other characters to get the
desired result. E.g., the byte sequence
0xc2 0x61
(non-spacing
acute accent, following by lower-case `a') to get the "small a with
acute" character. To get the acute accent character on its own, one has
to write 0xc2 0x20
(the non-spacing acute followed by a space).
This type of character set is used in some embedded systems such as
teletex.
- Instead of converting the Unicode or ISO 10646 text used internally,
it is often also sufficient to simply use an encoding different than
UCS-2/UCS-4. The Unicode and ISO 10646 standards even specify such an
encoding: UTF-8. This encoding is able to represent all of ISO
10464 31 bits in a byte string of length one to six.
There were a few other attempts to encode ISO 10646 such as UTF-7
but UTF-8 is today the only encoding which should be used. In fact,
UTF-8 will hopefully soon be the only external encoding that has to be
supported. It proves to be universally usable and the only disadvantage
is that it favors Roman languages by making the byte string
representation of other scripts (Cyrillic, Greek, Asian scripts) longer
than necessary if using a specific character set for these scripts.
Methods like the Unicode compression scheme can alleviate these
problems.
The question remaining is: how to select the character set or encoding
to use. The answer: you cannot decide about it yourself, it is decided
by the developers of the system or the majority of the users. Since the
goal is interoperability one has to use whatever the other people one
works with use. If there are no constraints the selection is based on
the requirements the expected circle of users will have. I.e., if a
project is expected to only be used in, say, Russia it is fine to use
KOI8-R or a similar character set. But if at the same time people from,
say, Greece are participating one should use a character set which allows
all people to collaborate.
The most widely useful solution seems to be: go with the most general
character set, namely ISO 10646. Use UTF-8 as the external encoding
and problems about users not being able to use their own language
adequately are a thing of the past.
One final comment about the choice of the wide character representation
is necessary at this point. We have said above that the natural choice
is using Unicode or ISO 10646. This is not required, but at least
encouraged, by the ISO C standard. The standard defines at least a
macro __STDC_ISO_10646__
that is only defined on systems where
the wchar_t
type encodes ISO 10646 characters. If this
symbol is not defined one should as much as possible avoid making
assumption about the wide character representation. If the programmer
uses only the functions provided by the C library to handle wide
character strings there should not be any compatibility problems with
other systems.