There are a couple open issues about UTF-8 and Unicode. I was going to write this as a comment on one of them, but I wanted to make a new issue to address Unicode support in general.
(I'm happy to begin working on Unicode implementation, as soon as the issues mentioned below are discussed.)
I have been contemplating what it would take to integrate Unicode into CWEB
.
There are several things to consider. I am assuming that UTF-8 is the only input/output encoding that need be supported.
What should the internal representation of characters be?
- Keeping them in UTF-8 form is attractive because the code can continue using
char
without fear; however, at some point a certain amount of decoding is required. The full extent depends on how much error checking we want to do and on the preferred action of CTANGLE
. As an asthetic choice, eight_bits
or a new, synonymous type octet
could be substituted for char
when the value is an octet of UTF-8 input.
- UTF-16 is, I think you'll agree, a silly choice. To adopt it would have no benefits that I can see over UTF-32, other than that it takes less space.
- Decoding the input fully into UTF-32 form, storing every character's full code point, is a viable strategy. One advantage is that encoding/decoding code can be separated from the parts of the programs that work with characters in memory. Unfortunately, all code would have to be modified to work with
uint_fast32_t
or whatever (probably hidden behind a code_point
typedef) instead of char
. The other major issue is that ASCII characters, which constitute the majority of typical C text, unconditionally occupy four times more storage than is necessary. But this isn't the greatest concern nowadays. It is convenient that every character takes up a single value.
The programs often advance to the next character in a string by incrementing a pointer by 1. If UTF-8 is chosen as the internal representation, then all such increments will have to be adjusted to compensate. Using UTF-32 would avoid this problem.
In summary, storing characters in UTF-32 form takes up more space, forces encoding/decoding, and requires altering most declarations related to characters; storing characters in UTF-8 saves space and allows declarations to remain unchanged, but most operations on characters would have to be changed.
Encoding or decoding could happen at the following points:
- When storing names for sorting (see the heading “Collation” below).
- When
CTANGLE
is reading @'
…'
. We probably want to extend the notation so that it “expands” into the ordinal value of any single character in the string, provided that that character corresponds to one code point. (Thus no notice is taken of combining characters.)
- When
CTANGLE
is converting names for output, if it must transliterate (see the heading “Transliteration” below).
It might be easier to do encoding/decoding manually, not by trying to use any of C's “wide character” facilities. (Frankly, I find them obnoxious. Also, many uses of C input/output functions would have to be changed.)
One good thing about UTF-8 is that it is quite naturally expressed in octal, so CWEB
's preference could be maintained through the transition.
Unicode character data
In any case, the hardest part about supporting Unicode beyond simple encoding and decoding is dealing with the Unicode character database. Unicode 13.0 assigns (gives meaning to) 143 859 out of 1 114 112 possible code points. Every character has many properties that describe it.
Unicode distributes a bunch of plain text files that contain the property data for all characters. Unfortunately, there is no file that consolidates all information into one place, except for the Unicode XML database.
I'm going to ignore the task of reading the data in for now. The more interesting problem is this: How do we store information about every character? A full implementation of Unicode would be forced to have a way to get the value of any property, but CWEB
needs only a limited set.
Width.
CWEB
's error reporting routine indicates the current position in the buffer by printing it out like this:
first part of line
second, unread part of line
The problem is that the code assumes that all characters occupy the same amount of horizontal space. In reality, some characters have no width, some are wider than one column, etc. The amount of effort it would take to get this correct probably far outweighs the utility of the feature. But it's certainly possible; GCC handles cursor position in Unicode input just fine.
Transliteration.
For CTANGLE
, we must be able to associate some string of text with a character, defining its transliteration. All that's needed is a char *
.
C99 and C++98 added a syntactic feature called a “universal character name”, which is basically a four- or eight-digit hexadecimal character code embedded in regular source text. For example, a\u200Bb
gives you ab
, where the two characters are separated by a zero-width space. According to Annex D of the C standard and lex.name.allowed in the C++ standard, this is a perfectly valid identifier. However, both languages prohibit many characters to appear as universal character names in identifiers. It is tempting to change CTANGLE
's default transliteration to insert an equivalent universal character name, but the restrictions complicate matters.
Normalization.
Some strings of Unicode characters are effectively identical while not being exactly (i.e., numerically) equal. For example, a precomposed character like “ü” (U+00FC LATIN SMALL LETTER U WITH DIAERESIS) should usually be treated identically to its decomposed counterpart “ü” (U+0075 LATIN SMALL LETTER U and U+0308 COMBINING DIAERESIS).
Therefore Unicode defines (in UAX 15) a process of normalization, which converts strings to a canonical form. There are a few kinds of normalization, depending on whether you want to tend towards decomposing characters or towards composing characters and how you want to handle compatibility characters.
Several properties are associated with normalization, including Canonical_Combining_Class (a nonnegative integer below 256), Decomposition_Type (one of sixteen values), and Decomposition_Mapping (a string of at most eighteen code points).
It would probably be best for CWEB
to normalize all strings before entering them into the character/byte memory.
Identifiers.
If we want “extended characters” to be allowed in identifiers, we need to know exactly which code points can begin an identifier and which code points can continue an identifier. Luckily there are properties just for this, thanks to UAX 31. Specifically, if a character has the property XID_Start, it can begin an identifier, and if a character has the property XID_Continue, it can be a part of an identifier.
(There are also ID_Start and ID_Continue. The X variants are for normalized text only.)
Collation.
Here's the big one. The entirety of CWEAVE
's Phase III is devoted to sorting and outputting an index. Sorting the index involves putting names in order, according to a collating sequence; in the current version of CWEAVE
, the collation is represented by the collate
array. Unicode collation is much more complex, due to the expanded character set.
Full details of the Unicode collation algorithm can be found in UTS 10. It is based on four levels of comparison between strings. The specification requires that strings be normalized before comparison.
Collation needs a collation element table to work. The Default Unicode Collation Table (DUCET) can be found here; like the rest of the Unicode data, it is stored in a plain text file. In the DUCET, only three of the four levels of comparison are used, in order to allow implementations to extend the order for whatever internal reason. Other collation element tables exist for specific languages or conventions.
Storing the data.
In general, we want a way to map a twenty-one-bit number (probably held in a thirty-two-bit integer) to some data structure containing the character properties we are interested in. Storing all the needed information straightforwardly in a statically-allocated array would occupy about 45 megabytes on a sixty-four-bit system. I'm counting
- The transliteration (
char *
)
- Canonical combining class (
uint8_t
)
- Decomposition type (
short
)
- Decomposition mapping (
char *
or code_point *
depending on the internal
representation of characters)
- XID start (
bool
)
- XID continue (
bool
)
- Collation element (
struct { uint16_t a, b, c, d; }
)
We would have the transliteration string be NULL
if no transliteration was given; then CTANGLE
would compute it automatically.
I think that more attributes must be stored for normalization, so 45 megabytes is really a lower bound.
There are many ways of compressing this, of course. Full Unicode implementations typically use a kind of trie for looking up properties, because the entire set of properties for a
single character takes up a lot of space. Compression is also possible because long runs of characters tend to share properties.
Since CWEAVE
doesn't do transliteration, and since CTANGLE
doesn't do collation, the two areas of storage could put into a union.
Actually getting the data.
I glossed over this earlier, but it's important. How can CWEB
read the character information into memory? There is far too much to compile directly into the programs; should it be read at initialization? Ideally we could do what TeX does and save the program's state after initialization, but I'm not sure if there is a good, portable way.
The property information we want is found in the files UnicodeData.txt
, DerivedCoreProperties.txt
, allkeys.txt
, and DerivedNormalizationProps.txt
. Thus if CWEAVE
or CTANGLE
are starting up from scratch, they must read in four very large files.
Alternatively, we could write a program to extract only the relevant data from the relevant files and write it in an especially compact form to a new file, which would be read by CWEAVE
and CTANGLE
. I think that the most recent version of such a file should be distributed with CWEB
, but I can certainly see arguments to the contrary.
[The program could be a more general utility (serving as another example of CWEB
) that creates a compressed file containing a specified set of properties for each character. For instance, you might want to know only the names and aliases of characters; you can open the program, enter “name,alias
”, and it would output a file accordingly.]
Or use a library.
I'm against this option. One of CWEB
's appeals is that it is very easy to set up. It has no dependencies except on the C standard library; all you need is a C compiler to run CWEB
. Existing Unicode implementations are bulky and annoying, and they wouldn't fit in with the rest of CWEB
.