A variety of solutions to overcome the differences between character sets with a 1:1 relation between bytes and characters and character sets with ratios of 2:1 or 4:1 exist. The remainder of this section gives a few examples to help understand the design decisions made while developing the functionality of the C library.
A distinction we have to make right away is between internal and external representation. Internal representation means the representation used by a program while keeping the text in memory. External representations are used when text is stored or transmitted through whatever communication channel. Examples of external representations include files lying in a directory that are going to be read and parsed.
Traditionally there was no difference between the two representations. It was equally comfortable and useful to use the same one-byte representation internally and externally. This changes with more and larger character sets.
One of the problems to overcome with the internal representation is handling text which is externally encoded using different character sets. Assume a program which reads two texts and compares them using some metric. The comparison can be usefully done only if the texts are internally kept in a common format.
For such a common format (@math{=} character set) eight bits are certainly no longer enough. So the smallest entity will have to grow: wide characters will now be used. Instead of one byte, two or four will be used instead. (Three are not good to address in memory and more than four bytes seem not to be necessary).
As shown in some other part of this manual, there exists a completely new family of functions which can handle texts of this kind in memory. The most commonly used character set for such internal wide character representations are Unicode and ISO 10646. The former is a subset of the latter and used when wide characters are chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4 (@math{= 32} bits).
To represent wide characters the char
type is not suitable. For
this reason the ISO C standard introduces a new type which is
designed to keep one character of a wide character string. To maintain
the similarity there is also a type corresponding to int
for
those functions which take a single wide character.
char[]
for multibyte character strings. The type is defined in `stddef.h'.
The ISO C89 standard, where this type was introduced, does not say
anything specific about the representation. It only requires that this
type is capable to store all elements of the basic character set.
Therefore it would be legitimate to define wchar_t
and
char
. This might make sense for embedded systems.
But for GNU systems this type is always 32 bits wide. It is therefore
capable to represent all UCS4 value therefore covering all of ISO
10646. Some Unix systems define wchar_t
as a 16 bit type and
thereby follow Unicode very strictly. This is perfectly fine with the
standard but it also means that to represent all characters from Unicode
and ISO 10646 one has to use surrogate character which is in fact a
multi-wide-character encoding. But this contradicts the purpose of the
wchar_t
type.
wint_t
is a data type used for parameters and variables which
contain a single wide character. As the name already suggests it is the
equivalent to int
when using the normal char
strings. The
types wchar_t
and wint_t
have often the same
representation if their size if 32 bits wide but if wchar_t
is
defined as char
the type wint_t
must be defined as
int
due to the parameter promotion.
This type is defined in `wchar.h' and got introduced in the second amendment to ISO C 89.
As there are for the char
data type there also exist macros
specifying the minimum and maximum value representable in an object of
type wchar_t
.
WCHAR_MIN
evaluates to the minimum value representable
by an object of type wint_t
.
This macro got introduced in the second amendment to ISO C89.
WCHAR_MIN
evaluates to the maximum value representable
by an object of type wint_t
.
This macro got introduced in the second amendment to ISO C89.
Another special wide character value is the equivalent to EOF
.
WEOF
evaluates to a constant expression of type
wint_t
whose value is different from any member of the extended
character set.
WEOF
need not be the same value as EOF
and unlike
EOF
it also need not be negative. I.e., sloppy code like
{ int c; ... while ((c = getc (fp)) < 0) ... }
has to be rewritten to explicitly use WEOF
when wide characters
are used.
{ wint_t c; ... while ((c = wgetc (fp)) != WEOF) ... }
This macro was introduced in the second amendment to ISO C89 and is defined in `wchar.h'.
These internal representations present problems when it comes to storing and transmittal, since a single wide character consists of more than one byte they are effected by byte-ordering. I.e., machines with different endianesses would see different value accessing the same data. This also applies for communication protocols which are all byte-based and therefore the sender has to decide about splitting the wide character in bytes. A last (but not least important) point is that wide characters often require more storage space than an customized byte oriented character set.
For all the above reasons, an external encoding which is different
from the internal encoding is often used if the latter is UCS2 or UCS4.
The external encoding is byte-based and can be chosen appropriately for
the environment and for the texts to be handled. There exist a variety
of different character sets which can be used for this external
encoding. Information which will not be exhaustively presented
here--instead, a description of the major groups will suffice. All of
the ASCII-based character sets [_bkoz_: do you mean Roman character
sets? If not, what do you mean here?] fulfill one requirement: they are
"filesystem safe". This means that the character '/'
is used in
the encoding only to represent itself. Things are a bit
different for character sets like EBCDIC (Extended Binary Coded Decimal
Interchange Code, a character set family used by IBM) but if the
operation system does not understand EBCDIC directly the parameters to
system calls have to be converted first anyhow.
0xc2 0x61
(non-spacing
acute accent, following by lower-case `a') to get the "small a with
acute" character. To get the acute accent character on its on one has
to write 0xc2 0x20
(the non-spacing acute followed by a space).
This type of characters sets is quite frequently used in embedded
systems such as video text.
The question remaining is: how to select the character set or encoding to use. The answer: you cannot decide about it yourself, it is decided by the developers of the system or the majority of the users. Since the goal is interoperability one has to use whatever the other people one works with use. If there are no constraints the selection is based on the requirements the expected circle of users will have. I.e., if a project is expected to only be used in, say, Russia it is fine to use KOI8-R or a similar character set. But if at the same time people from, say, Greece are participating one should use a character set which allows all people to collaborate.
The most widely useful solution seems to be: go with the most general character set, namely ISO 10646. Use UTF-8 as the external encoding and problems about users not being able to use their own language adequately are a thing of the past.
One final comment about the choice of the wide character representation
is necessary at this point. We have said above that the natural choice
is using Unicode or ISO 10646. This is not specified in any
standard, though. The ISO C standard does not specify anything
specific about the wchar_t
type. There might be systems where
the developers decided differently. Therefore one should as much as
possible avoid making assumption about the wide character representation
although GNU systems will always work as described above. If the
programmer uses only the functions provided by the C library to handle
wide character strings there should not be any compatibility problems
with other systems.
Go to the first, previous, next, last section, table of contents.