2: Character Encoding

Glk has two separate, but parallel, APIs for managing text input and output. The basic functions deals entirely in 8-bit characters; their arguments are arrays of bytes (octets). These functions all assume the Latin-1 character encoding. Equivalently, they may be said to use code points U+00..U+FF of Unicode.

Latin-1 is an 8-bit character encoding; it maps numeric codes in the range 0 to 255 into printed characters. The values from 32 to 126 are the standard printable ASCII characters (' ' to '~'). Values 0 to 31 and 127 to 159 are reserved for control characters, and have no printed equivalent.

[Note that the basic Glk text API does not use UTF-8, or any other Unicode character form. Each character is represented by a single byte -- even characters in the 128..255 range.]

The extended, or "Unicode", Glk functions deal entirely in 32-bit words. They take arrays of words, not bytes, as arguments. They can therefore cope with any Unicode code point. The extended functions have names ending in "_uni".

[Since these functions deal in arrays of 32-bit words, they can be said to use the UTF-32 character encoding form. (But not the UTF-32 character encoding scheme -- that's a stream of bytes which must be interpreted in big-endian or little-endian mode. Glk Unicode functions operate on long integers, not bytes.) UTF-32 is also known as UCS-4, according to the Unicode spec (appendix C.2), modulo some semantic requirements which we will not deal with here. For practical purposes, we can ignore the whole encoding issue, and assume that we are dealing with sequences of numeric code points.]

[Why not UTF-8? It is a reasonable bare-bones compression algorithm for Unicode character streams; but IF systems typically have their own compression models for text. Compositing the two ideas causes more problems than it solves. The other advantage of UTF-8 is that 7-bit ASCII is automatically valid UTF-8; but this is not compelling for IF systems, in which the compiler can be tasked with generating consistent textual data. And UTF-8 is a variable-width encoding. Nobody ever wept at the prospect of avoiding that kettle of eels.]

[What about bi-directional text? It's a good idea, and may show up in future versions of this document. It is not in this version because we want to get something simple implemented soon. For the moment, print out all text in reading order (not necessarily left-to-right) and hope for the best. Current suggestions include a stylehint_Direction, which the game can set to indicate that text in the given style should be laid out right-to-left. Top-to-bottom (or bottom-to-top) may be desirable too. The direction stylehints might only apply to full paragraphs (like justification stylehints); or they might apply to any text, thus requiring the library to lay out "zig-zag" blocks. The possibilities remain to be explored. Page layout is hard.]

[Another possibility is to let the library determine the directionality of text from the character set. This is not impossible -- MacOSX text widgets do it. It may be too difficult.]

[In the meantime, it is worth noting that the Windows Glk library does not autodetect directionality, but the CheapGlk library running on MacOSX does. Therefore, there is no platform-independent way to handle right-to-left fonts at present.]

2.1: Testing for Unicode Capabilities

The basic text functions will be available in every Glk library. The Unicode functions may or may not be available. Before calling them, you should use the following gestalt selectors:

glui32 res;
res = glk_gestalt(gestalt_Unicode, 0);

This returns 1 if the core Unicode functions are available. If it returns 0, you should not try to call them. They may print nothing, print gibberish, or cause a run-time error. The Unicode functions include glk_buffer_to_lower_case_uni, glk_buffer_to_upper_case_uni, glk_buffer_to_title_case_uni, glk_put_char_uni, glk_put_string_uni, glk_put_buffer_uni, glk_put_char_stream_uni, glk_put_string_stream_uni, glk_put_buffer_stream_uni, glk_get_char_stream_uni, glk_get_buffer_stream_uni, glk_get_line_stream_uni, glk_request_char_event_uni, glk_request_line_event_uni, glk_stream_open_file_uni, glk_stream_open_memory_uni.

If you are writing a C program, there is an additional complication. A library which does not support Unicode may not implement the Unicode functions at all. Even if you put gestalt tests around your Unicode calls, you may get link-time errors. If the glk.h file is so old that it does not declare the Unicode functions and constants, you may even get compile-time errors.

To avoid this, you can perform a preprocessor test for the existence of GLK_MODULE_UNICODE. If this is defined, so are all the Unicode functions and constants. If not, not.

glui32 res;
res = glk_gestalt(gestalt_UnicodeNorm, 0);

This returns 1 if the Unicode normalization functions are available. If it returns 0, you should not try to call them. The Unicode normalization functions include glk_buffer_canon_decompose_uni and glk_buffer_canon_normalize_uni.

The equivalent preprocessor test for these functions is GLK_MODULE_UNICODE_NORM.

2.2: Output

When you are sending text to a window, or to a file open in text mode, you can print any of the printable Latin-1 characters: 32 to 126, 160 to 255. You can also print the newline character (linefeed, control-J, decimal 10, hex 0x0A.)

It is not legal to print any other control characters (0 to 9, 11 to 31, 127 to 159). You may not print even common formatting characters such as tab (control-I), carriage return (control-M), or page break (control-L). [As usual, the behavior of the library when you print an illegal character is undefined. It is preferable that the library display a numeric code, such as "\177" or "0x7F", to warn the user that something illegal has occurred. The library may skip illegal characters entirely; but you should not rely on this.]

Printing Unicode characters above 255 is a more complicated matter -- too complicated to be covered precisely by this specification. Refer to the Unicode specification, and good luck to you.

[Unicode combining characters are a particular nuisance. Printing a combining character may alter the appearance of the previous character printed. The library should be prepared to cope with this -- even if the characters are printed by two separate glk_put_char_uni() calls.]

Note that when you are sending data to a file open in binary mode, you can print any byte value, without restriction. See section 5.6.3, "File Streams".

A particular implementation of Glk may not be able to display all the printable characters. It is guaranteed to be able to display the ASCII characters (32 to 126, and the newline 10.) Other characters may be printed correctly, printed as multi-character combinations (such as "ae" for the one-character "ae" ligature (æ)), or printed as some placeholder character (such as a bullet or question mark, or even an octal code.)

You can test for this by using the gestalt_CharOutput selector. If you set ch to a character code (Latin-1 or higher), and call

glui32 res, len;
res = glk_gestalt_ext(gestalt_CharOutput, ch, &len, 1);

then res will be one of the following values:

In all cases, len (the glui32 value pointed at by the third argument) will be the number of actual glyphs which will be used to represent the character. In the case of gestalt_CharOutput_ExactPrint, this will always be 1; for gestalt_CharOutput_CannotPrint, it may be 0 (nothing printed) or higher; for gestalt_CharOutput_ApproxPrint, it may be 1 or higher. This information may be useful when printing text in a fixed-width font.

[As described in section 1.9, "Other API Conventions", you may skip this information by passing NULL as the third argument in glk_gestalt_ext(), or by calling glk_gestalt() instead.]

This selector will always return gestalt_CharOutput_CannotPrint if ch is an unprintable eight-bit character (0 to 9, 11 to 31, 127 to 159.)

[Make sure you do not get confused by signed byte values. If you set a "signed char" variable ch to 0xFE, the small-thorn character (þ), it will wind up as -2. (The same is true of a "char" variable, if your compiler treats "char" as signed!) If you then call

res = glk_gestalt(gestalt_CharOutput, ch);

then (by the definition of C/C++) ch will be sign-extended to 0xFFFFFFFE, which is not a legitimate character, even in Unicode. You should write
res = glk_gestalt(gestalt_CharOutput, (unsigned char)ch);

instead.]

[Unicode includes the concept of non-spacing or combining characters, which do not represent glyphs; and double-width characters, whose glyphs take up two spaces in a fixed-width font. Future versions of this spec may recognize these concepts by returning a len of 0 or 2 when gestalt_CharOutput_ExactPrint is used. For the moment, we are adhering to a policy of "simple stuff first".]

2.3: Line Input

You can request that the player enter a line of text. See section 4.2, "Line Input Events".

This text will be placed in a buffer of your choice. There is no length field or null terminator in the buffer. (The length of the text is returned as part of the line-input event.)

If you use the basic text API, the buffer will contain only printable Latin-1 characters (32 to 126, 160 to 255).

A particular implementation of Glk may not be able to accept all Latin-1 printable characters as input. It is guaranteed to be able to accept the ASCII characters (32 to 126.)

You can test for this by using the gestalt_LineInput selector. If you set ch to a character code, and call

glui32 res;
res = glk_gestalt(gestalt_LineInput, ch);

then res will be TRUE (1) if that character can be typed by the player in line input, and FALSE (0) if not. Note that if ch is a nonprintable Latin-1 character (0 to 31, 127 to 159), then this is guaranteed to return FALSE.

2.4: Character Input

You can request that the player hit a single key. See section 4.1, "Character Input Events".

If you use the basic text API, the character code which is returned can be any value from 0 to 255. The printable character codes have already been described. The remaining codes are typically control codes: control-A to control-Z and a few others.

There are also a number of special codes, representing special keyboard keys, which can be returned from a char-input event. These are represented as 32-bit integers, starting with 4294967295 (0xFFFFFFFF) and working down. The special key codes are defined in the glk.h file. They include:

Various implementations of Glk will vary widely in which characters the player can enter. The most obvious limitation is that some characters are mapped to others. For example, most keyboards return a control-I code when the tab key is pressed. The Glk library, if it can recognize this at all, will generate a keycode_Tab event (value 0xFFFFFFF7) when this occurs. Therefore, for these keyboards, no keyboard key will generate a control-I event (value 9.) The Glk library will probably map many of the control codes to the other special keycodes.

[On the other hand, the library may be very clever and discriminate between tab and control-I. This is legal. The idea is, however, that if your program asks the player to "press the tab key", you should check for a keycode_Tab event as opposed to a control-I event.]

Some characters may not be enterable simply because they do not exist. [Not all keyboards have a home or end key. A pen-based platform may not recognize any control characters at all.]

Some characters may not be enterable because they are reserved for the purposes of the interface. For example, the Mac Glk library reserves the tab key for switching between different Glk windows. Therefore, on the Mac, the library will never generate a keycode_Tab event or a control-I event.

[Note that the linefeed or control-J character, which is the only printable control character, is probably not typable. This is because, in most libraries, it will be converted to keycode_Return. Again, you should check for keycode_Return if your program asks the player to "press the return key".]

[The delete and backspace keys are merged into a single keycode because they have such an astonishing history of being confused in the first place... this spec formally waives any desire to define the difference. Of course, a library is free to distinguish delete and backspace during line input. This is when it matters most; conflating the two during character input should not be a large problem.]

You can test for this by using the gestalt_CharInput selector. If you set ch to a character code, or a special code (from 0xFFFFFFFF down), and call

glui32 res;
res = glk_gestalt(gestalt_CharInput, ch);

then res will be TRUE (1) if that character can be typed by the player in character input, and FALSE (0) if not.

[Glk porters take note: it is not a goal to be able to generate every single possible key event. If the library says that it can generate a particular keycode, then game programmers will assume that it is available, and ask players to use it. If a keycode_Home event can only be generated by typing escape-control-A, and the player does not know this, the player will be lost when the game says "Press the home key to see the next hint." It is better for the library to say that it cannot generate a keycode_Home event; that way the game can detect the situation and ask the user to type H instead.]

[Of course, it is better not to rely on obscure keys in any case. The arrow keys and return are nearly certain to be available; the others are of gradually decreasing reliability, and you (the game programmer) should not depend on them. You must be certain to check for the ones you want to use, including the arrow keys and return, and be prepared to use different keys in your interface if gestalt_CharInput says they are not available.]

2.5: Upper and Lower Case

You can convert Latin-1 characters between upper and lower case with two Glk utility functions:

unsigned char glk_char_to_lower(unsigned char ch);
unsigned char glk_char_to_upper(unsigned char ch);

These have a few advantages over the standard ANSI tolower() and toupper() macros. They work for the entire Latin-1 character set, including accented letters; they behave consistently on all platforms, since they're part of the Glk library; and they are safe for all characters. That is, if you call glk_char_to_lower() on a lower-case character, or a character which is not a letter, you'll get the argument back unchanged.

The case-sensitive characters in Latin-1 are the ranges 0x41..0x5A, 0xC0..0xD6, 0xD8..0xDE (upper case) and the ranges 0x61..0x7A, 0xE0..0xF6, 0xF8..0xFE (lower case). These are arranged in parallel; so glk_char_to_lower() will add 0x20 to values in the upper-case ranges, and glk_char_to_upper() will subtract 0x20 from values in the lower-case ranges.

Unicode character conversion is trickier, and must be applied to character arrays, not single characters.

glui32 glk_buffer_to_lower_case_uni(glui32 *buf, glui32 len, glui32 numchars);
glui32 glk_buffer_to_upper_case_uni(glui32 *buf, glui32 len, glui32 numchars);
glui32 glk_buffer_to_title_case_uni(glui32 *buf, glui32 len, glui32 numchars, glui32 lowerrest);

These functions provide two length arguments because a string of Unicode characters may expand when its case changes. The len argument is the available length of the buffer; numchars is the number of characters in the buffer initially. (So numchars must be less than or equal to len. The contents of the buffer after numchars do not affect the operation.)

The functions return the number of characters after conversion. If this is greater than len, the characters in the array will be safely truncated at len, but the true count will be returned. (The contents of the buffer after the returned count are undefined.)

The lower_case and upper_case functions do what you'd expect: they convert every character in the buffer (the first numchars of them) to its upper or lower-case equivalent, if there is such a thing.

The title_case function has an additional (boolean) flag. If the flag is zero, the function changes the first character of the buffer to upper-case, and leaves the rest of the buffer unchanged. If the flag is nonzero, it changes the first character to upper-case and the rest to lower-case.

See the Unicode spec (chapter 3.13, chapter 4.2, etc) for the exact definitions of upper, lower, and title-case mapping.

[Unicode has some strange case cases. For example, a combined character that looks like "ss" might properly be upper-cased into two "S" characters. Title-casing is even stranger; "ss" (at the beginning of a word) might be title-cased into a different combined character that looks like "Ss". The glk_buffer_to_title_case_uni() function is actually title-casing the first character of the buffer. If it makes a difference.]

[Earlier drafts of this spec had a separate function which title-cased the first character of every word in the buffer. I took this out after reading Unicode Standard Annex #29, which explains how to divide a string into words. If you want it, feel free to implement it.]

2.6: Unicode String Normalization

Comparing Unicode strings is difficult, because there can be several ways to represent a piece of text as a Unicode string. For example, the one-character string "è" (an accented "e") will be displayed the same as the two-character string containing "e" followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT). These strings should be considered equal.

Therefore, a Glk program that accepts line input should convert its text to a normalized form before parsing it. These functions offer those conversions. The algorithms are defined by the Unicode spec (chapter 3.7) and Unicode Standard Annex #15.

glui32 glk_buffer_canon_decompose_uni(glui32 *buf, glui32 len, glui32 numchars);

This transforms a string into its canonical decomposition ("Normalization Form D"). Effectively, this takes apart multipart characters into their individual parts. For example, it would convert "è" (character 0xE8, an accented "e") into the two-character string containing "e" followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT). If a single character has multiple accent marks, they are also rearranged into a standard order.

glui32 glk_buffer_canon_normalize_uni(glui32 *buf, glui32 len, glui32 numchars);

This transforms a string into its canonical decomposition and recomposition ("Normalization Form C"). Effectively, this takes apart multipart characters, and then puts them back together in a standard way. For example, this would convert the two-character string containing "e" followed by Unicode character 0x0300 (COMBINING GRAVE ACCENT) into the one-character string "è" (character 0xE8, an accented "e").

The canon_normalize function includes decomposition as part of its implementation. You never have to call both functions on the same string.

Both of these functions are idempotent.

These functions provide two length arguments because a string of Unicode characters may expand when it is transformed. The len argument is the available length of the buffer; numchars is the number of characters in the buffer initially. (So numchars must be less than or equal to len. The contents of the buffer after numchars do not affect the operation.)

The functions return the number of characters after transformation. If this is greater than len, the characters in the array will be safely truncated at len, but the true count will be returned. (The contents of the buffer after the returned count are undefined.)

[The Unicode spec also defines stronger forms of these functions, called "compatibility decomposition and recomposition" ("Normalization Form KD" and "Normalization Form KC".) These do all of the accent-mangling described above, but they also transform many other obscure Unicode characters into more familiar forms. For example, they split ligatures apart into separate letters. They also convert Unicode display variations such as script letters, circled letters, and half-width letters into their common forms.]

[The Glk spec does not currently provide these stronger transformations. Glk's expected use of Unicode normalization is for line input, and an OS facility for line input will generally not produce these alternate character forms (unless the user goes out of his way to type them). Therefore, the need for these transformations does not seem to be worth the extra data table space.]

2.6.1: A Note on Unicode Case-Folding and Normalization

With all of these Unicode transformations hovering about, an author might reasonably ask about the right way to handle line input. Our recommendation is: call glk_buffer_to_lower_case_uni(), followed by glk_buffer_canon_normalize_uni(), and then parse the result. The parsing process should of course match against strings that have been put through the same process.

The Unicode spec (chapter 3.13) gives a different, three-step process: decomposition, case-folding, and decomposition again. Our recommendation comes through a series of practical compromises:

[We may revisit these recommendations in future versions of the spec.]


Up to top Previous chapter Next chapter