Translation Tables
InterSystems IRIS® data platform uses translation tables (also known as I/O tables) for the task of converting characters. Some API calls (and the $zconvert function) can accept a translation table as an argument. This page provides reference information on the available translation tables.
Introduction
There are two general scenarios in which translation tables are used to convert characters:
-
In many contexts (such as in URLs, in HTML, in JSON, and so on), specific characters are disallowed and must be represented by escape sequences. In this case, it is necessary to convert the characters to or from the allowed set of characters.
-
If you are reading from a source outside the database or writing to a destination outside the database, that entity may expect a different character set than InterSystems IRIS uses. In this case, it is necessary to convert the character encoding.
The “translation table” for a given context is actually a pair of tables. One table specifies how to convert from the default character set to the foreign character set (or to the foreign context), and other specifies how to convert in the other direction. In InterSystems IRIS, the convention is to refer to this pair of tables as a single unit that has an input mode and an output mode. Thus, there is an HTML translation table for managing conversions to and from HTML, and there is an CP1250 translation table for managing conversions to and from the CP1250 character set.
List of Tables
The following is a list of the InterSystems IRIS translation tables:
On Windows, InterSystems IRIS performs no translation for 8-bit characters or 16-bit Latin-1 characters (Unicode characters in which the high-order byte has the value 00).
On UNIX®, if the LANG environment variable specifies an encoding (e.g. "UTF-8" as in "LANG=en_US.UTF-8"), and if that encoding corresponds to a known translation table, then the default system call translation will be set to that table. (Otherwise, InterSystems performs no translation, as in the Windows case.)
RAW translation should not be used for InterSystems IRIS systems using non-Latin-1 locales, such as rusw.
Translates 8-bit characters to the corresponding Unicode characters.
Adds (output mode) or removes (input mode) HTML escape characters to a string. See the Output Escaping table.
Uses a supplied JavaScript translation table to escape characters in the string for use within JavaScript. For output translations, see the Output Escaping table. For comparison of JS and JSML, see JS and JSML, JSON and JSONML Conversions. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.
Uses a supplied translation table to convert to JSON format. For output translations, see the Output Escaping table. For comparison of JSON and JSONML, see JS and JSML, JSON and JSONML Conversions. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.
Adds (output mode) or removes (input mode) URI parameter escape characters to a string. URI encodes the characters !"#$%&'()*+,/:;<=>?@[]^`{|} as follows: %20%21%22%23%24%25%26%27%28%29%2A%2B%2C%2F%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D.
The space character is encoded as %20.
The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.
URI does not encode the tilde (~) character. See the Output Escaping table.
URI encodes characters higher than $CHAR(255) (Unicode characters) as UTF-8 and then % encodes the UTF-8 values in hexadecimal notation.
Also see URL and URI Conversions.
Adds (output mode) or removes (input mode) URL parameter escape characters to a string. URL encodes the characters "#%&+,:;<=>?@[]^`{|}~ as follows: %20%22%23%25%26%2B%2C%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D%7E.
The space character is encoded as %20.
The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.
Refer to the Output Escaping table. Characters higher than $CHAR(255) are represented in Unicode hexadecimal notation: $CHAR(256) = %u0100.
Also see URL and URI Conversions.
UTF-8 encoding. This converts (output mode) 16-bit Unicode characters to a series of 8-bit characters. An ASCII 16–bit Unicode character translates to a single 8–bit character; for example, hex 0041 (the letter “A”) translates to the 8-bit character hex 41. A non-ASCII Unicode character is converted to two or three 8–bit characters.
Unicode hex 0080 through 07FF convert to two 8–bit characters; these include the Latin-1 Supplement and Latin Extended characters and the Greek, Cyrillic, Hebrew, and Arabic alphabets.
Unicode hex 0800 through FFFF convert to three 8–bit characters; these comprise the rest of the Unicode Basic Multilingual Plane. Thus, the ASCII characters $CHAR(0) through $CHAR(127) are the same in RAW and UTF8 mode; characters $CHAR(128) and above are converted.
Input mode reverses this conversion. Refer to Unicode for further details.
Adds (output mode) or removes (input mode) XML escape characters to a string. See the Output Escaping table.
The rest of the translation tables are specific to character set conversion, and these tables are named the same as those character sets. The tables include the following:
-
UnicodeLittle
-
UnicodeBig
-
CP1250
-
CP1251
-
CP1252
-
CP1253
-
CP1255
-
CP437
-
CP850
-
CP852
-
CP866
-
CP874
-
EBCDIC
-
Latin2
-
Latin9
-
LatinC
-
LatinG
-
LatinH
-
LatinT
See Related APIs, which includes a way to list the current translation tables.
Output Escaping
This section indicates how specific translation tables convert characters in output mode:
HTML | JS | JSON | URI | URL | XML | |
---|---|---|---|---|---|---|
null $CHAR(0) | \x00 | \u0000 | %00 | %00 | ||
$CHAR(1) through $CHAR(7) | \x01 through \x07 | \u0001 through \u0007 | %01 through %07 | %01 through %07 | ||
backspace $CHAR(8) | \b | \b | %08 | %08 | ||
horizontal tab $CHAR(9) | \t | \t | %09 | %09 | ||
line feed $CHAR(10) | \n | \n | %0A | %0A | ||
vertical tab $CHAR(11) | \v | \u000B | %0B | %0B | ||
form feed $CHAR(12) | \f | \f | %0C | %0C | ||
carriage return $CHAR(13) | \r | \r | %0D | %0D | ||
$CHAR(14) through $CHAR(31) | \u000E through \u001F | %0E through %1F | %0E through %1F | |||
$CHAR(32) | %20 | %20 | ||||
" (doubled) | " | \" | \” | %22 | %22 | " |
# | %23 | %23 | ||||
$ | %24 | |||||
% | %25 | %25 | ||||
& | & | %26 | %26 | & | ||
‘ (apostrophe) $CHAR(39) | ' | \' | %27 | ' | ||
( | %28 | |||||
) | %29 | |||||
* | %2A | |||||
+ | %2B | %2B | ||||
, | %2C | %2C | ||||
/ (slash) $CHAR(47) | \/ | %2F | ||||
: | %3A | %3A | ||||
; | %3B | %3B | ||||
< | < | %3C | %3C | < | ||
= | %3D | %3D | ||||
> | > | %3E | %3E | > | ||
? | %3F | %3F | ||||
@ | %40 | %40 | ||||
[ | %5B | %5B | ||||
\ | \\ | \\ | %5C | %5C | ||
] | %5D | %5D | ||||
^ | %5E | %5E | ||||
` | %60 | %60 | ||||
{ | %7B | %7B | ||||
| | %7C | %7C | ||||
} | %7D | %7D | ||||
~ | %7E | |||||
$CHAR(127) | %7F | %7F | ||||
$CHAR(128) through $CHAR(159) | %C2%80 through %C2%9F | %80 through %9F | ||||
$CHAR(160) | | %C2%A0 | %A0 | |||
$CHAR(161) through $CHAR(191) | %C2%A1 through %C2%BF | %A1 through %BF | ||||
$CHAR(192) through $CHAR(255) | %C3%80 through %C3%BF | %C0 through %FF |
URL and URI Conversions
A URL or URI can only contain certain 8-bit ASCII characters. All other characters must be represented by an escape sequence beginning with %. If you wish to convert a string containing Unicode characters to a URL or URI, you must first convert your local representation to an 8-bit intermediate representation, using UTF-8 encoding. You then convert the UTF-8 results to URL encoding. To convert a URL back to its original Unicode string, you perform the reverse operation. This is shown in the following example:
SET ustring="US$ to "_$CHAR(8364)_" échange"
WRITE "initial string is: ",ustring,!
ConvertUnicodeToURL
SET utfo = $ZCONVERT(ustring,"O","UTF8")
SET urlo = $ZCONVERT(utfo,"O","URL")
WRITE "Unicode to URL conversion: ",urlo,!
ConvertURLtoUnicode
SET urli = $ZCONVERT(urlo,"I","URL")
SET utfi = $ZCONVERT(urli,"I","UTF8")
WRITE "URL to Unicode conversion: ",utfi
JS and JSML, JSON, and JSONML Conversions
The JS and JSON translations use UTF-8 encoding for Unicode characters. The JSML and JSONML translations render Unicode characters without encoding. For ASCII characters ($CHAR(0) through $CHAR(127)), JS and JSML encodings are identical. For ASCII characters ($CHAR(0) through $CHAR(127)), JSON and JSONML encodings are identical.
The following example compares the translation of JS and JSML characters:
FOR i=1:1:256 {
SET x=$ZCVT($C(i),"O","JS")
SET y=$ZCVT($C(i),"O","JSML")
IF x=y {
WRITE "."
} ELSE {
WRITE !!,$ZHEX(i),!,"JS: " ZZDUMP x WRITE !,"JSML: " ZZDUMP y
}
}
Related APIs
For the currently available list of translation tables, refer to the XLTTablesOpens in a new tab property of %SYS.NLS.LocaleOpens in a new tab, as shown in the following example:
SET nlsoref=##class(%SYS.NLS.Locale).%New()
WRITE $LISTTOSTRING(nlsoref.XLTTables,", ")
Also, you can use %Net.CharsetOpens in a new tab to represent character sets within InterSystems IRIS. This class includes the following class methods:
-
GetDefaultCharset() returns the default character set for the current InterSystems IRIS locale (see next heading).
-
GetTranslateTable() returns the name of the InterSystems IRIS translation table for a given input character set.
-
TranslateTableExists() indicates whether the translation table for the given character set has been loaded.
For method signatures, see the class documentation for %Net.CharsetOpens in a new tab.