Language

The Free and Open Productivity Suite
Released: Apache OpenOffice 4.1.15

Text Conversion Functions

OpenOffice.org

This text describes the functions rtl_convertTextToUnicode() and rtl_convertUnicodeToText(), the meaning of all the accompanying RTL_TEXTTOUNICODE_FLAGS_XXX, RTL_TEXTTOUNICODE_INFO_XXX, RTL_UNICODETOTEXT_FLAGS_XXX and RTL_UNICODETOTEXT_INFO_XXX flags, and the conversion context conventions.

Conversion Context

It is valid to pass a null pointer instead of an rtl_TextToUnicodeContext or rtl_UnicodeToTextContext to the conversion functions. In that case, the functions behave as if they received an initial context, as obtained by rtl_createTextToUnicodeContext(), rtl_resetTextToUnicodeContext(), rtl_createUnicodeToTextContext(), or rtl_resetUnicodeToTextContext(), and simply do not return any context information (which is effectively lost). This implies that you should always specify the FLAGS_FLUSH flag when using a null context, for otherwise it is not possible in general to find out whether the input buffer has been completely converted.

Handling of Undefined Codes

An undefined code is any of the following:

In the text-to-Unicode direction, the conversion functions distinguish between single-byte and multi-byte undefined codes (0xA5 in ISO 8859-3 and 0x80 in GB-18030 are single-byte undefined codes, while 0xA2A1 in EUC-CN and 0xFE39FE39 in GB-18030 are multi-byte undefined codes.)

When encountering an undefined code, the conversion functions allow any of the following behaviours (which are mutually exclusive):

FLAGS_UNDEFINED_ERROR
FLAGS_MBUNDEFINED_ERROR
Read past the undefined code in the input buffer, set both the INFO_UNDEFINED or INFO_MBUNDEFINED and the INFO_ERROR flags, and immediately quit the conversion (ignoring any FLAGS_FLUSH flag).
FLAGS_UNDEFINED_IGNORE
FLAGS_MBUNDEFINED_IGNORE
Read past the undefined code in the input buffer, set the INFO_UNDEFINED or INFO_MBUNDEFINED flag, and continue with the conversion.
FLAGS_UNDEFINED_MAPTOPRIVATE
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write U+F1xx into the output buffer (where xx is the single-byte undefined code), and continue with the conversion.
FLAGS_UNDEFINED_0
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII NUL character (0x00) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_QUESTIONMARK
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII “?” character (0x3F) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_UNDERLINE
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII “_” character (0x5F) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_DEFAULT
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write some output-encoding–specific character (currently U+FFFD for Unicode and “?” for all other encodings) into the output buffer, and continue with the conversion.

In the Unicode-to-text direction, the conversion functions also allow any of the following extra flags (of which an arbitrary number can be specified). In all cases, the usual checks for an exhausted output buffer are made, and otherwise the INFO_UNDEFINED flag is set.

FLAGS_UNDEFINED_REPLACE
Some Unicode characters that have no direct mapping to the destination encoding are mapped to similar single characters in the destination encoding. For example, U+00A0 (NO-BREAK SPACE) could be mapped to 0x20 (SPACE) in ASCII. Expect this to be poorly supported by the current implementation.
FLAGS_UNDEFINED_REPLACESTR
Some Unicode characters that have no direct mapping to the destination encoding are mapped to similar strings of characters in the destination encoding. For example, U+00A9 (COPYRIGHT SIGN) could be mapped to the three-character string “(C)” in ASCII. Expect this to be poorly supported by the current implementation.
FLAGS_PRIVATE_MAPTO0
Private-use characters (U+E000U+F8FF, U+F0000U+FFFFD, and U+100000U+10FFFD) are mapped to an (appropriately encoded) ASCII NUL character (0x00) in the output buffer.
FLAGS_NONSPACING_IGNORE
Certain non-spacing characters, like U+200B (ZERO WIDTH SPACE) and U+FEFF (ZERO WIDTH NO-BREAK SPACE), are ignored. Expect some uncertainty in the current implementation as to which characters are affected.
FLAGS_CONTROL_IGNORE
Control characters (U+0000U+001F and U+007FU+009F) are ignored.
FLAGS_PRIVATE_IGNORE
Private-use characters (U+E000U+F8FF, U+F0000U+FFFFD, and U+100000U+10FFFD) are ignored.

There is also a FLAGS_NOCOMPOSITE flag, of which I am not sure what it should be used for.

Handling of Invalid Codes

An invalid code is a string of one or more units in the input buffer that is not valid according to the input encoding:

Invalid codes of the second category (that are potentially prefixes of valid strings) are handled specially at the end of the input buffer. If the FLAGS_FLUSH flag is specified, they are handled like all other invalid codes. Otherwise, the INFO_SRCBUFFERTOSMALL flag is set to indicate that the input buffer possibly ended in the middle of an input character (and the prefix is either not yet read, or is stored in the conversion context, or is partly read and partly stored in the conversion context).

When encountering an invalid code (other than the special cases at the end of the input buffer), the conversion functions allow any of the following behaviours (which are mutually exclusive):

FLAGS_INVALID_ERROR
Read past the invalid code in the input buffer, set both the INFO_INVALID and the INFO_ERROR flags, and immediately quit the conversion (ignoring any FLAGS_FLUSH flag).
FLAGS_INVALID_IGNORE
Read past the invalid code in the input buffer, set the INFO_INVALID flag, and continue with the conversion.
FLAGS_INVALID_0
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII NUL character (0x00) into the output buffer, and continue with the conversion.
FLAGS_INVALID_QUESTIONMARK
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII “?” character (0x3F) into the output buffer, and continue with the conversion.
FLAGS_INVALID_UNDERLINE
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII “_” character (0x5F) into the output buffer, and continue with the conversion.
FLAGS_INVALID_DEFAULT
If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write some output-encoding–specific character (currently U+FFFD for Unicode and “?” for all other encodings) into the output buffer, and continue with the conversion.

Handling of Destination Buffer Exhaustion

If, in the course of conversion, there is not enough space left in the output buffer (either for a normal character mapping or for a special mapping of undefined or invalid codes), the INFO_DESTBUFFERTOSMALL flag is set, and the conversion is quit immediately (ignoring any FLAGS_FLUSH flag). It is unspecified whether the input units that would overflow the output buffer are already read (and stored in the conversion context) or not, but the number of processed input buffer units returned by the conversion function will be correct in either case.

Author: Stephan Bergmann (Last modification $Date: 2004/12/08 14:22:01 $).
Copyright 2001 OpenOffice.org Foundation. All Rights Reserved.

Apache Software Foundation

Copyright & License | Privacy | Contact Us | Donate | Thanks

Apache, OpenOffice, OpenOffice.org and the seagull logo are registered trademarks of The Apache Software Foundation. The Apache feather logo is a trademark of The Apache Software Foundation. Other names appearing on the site may be trademarks of their respective owners.