Skip to main content
HealthShare Health Connect 2024.3
AskMe (beta)
Loading icon

Translation Tables

InterSystems IRIS® data platform uses translation tables (also known as I/O tables) for the task of converting characters. Some API calls (and the $zconvert function) can accept a translation table as an argument. This page provides reference information on the available translation tables.

Introduction

There are two general scenarios in which translation tables are used to convert characters:

  • In many contexts (such as in URLs, in HTML, in JSON, and so on), specific characters are disallowed and must be represented by escape sequences. In this case, it is necessary to convert the characters to or from the allowed set of characters.

  • If you are reading from a source outside the database or writing to a destination outside the database, that entity may expect a different character set than InterSystems IRIS uses. In this case, it is necessary to convert the character encoding.

The “translation table” for a given context is actually a pair of tables. One table specifies how to convert from the default character set to the foreign character set (or to the foreign context), and other specifies how to convert in the other direction. In InterSystems IRIS, the convention is to refer to this pair of tables as a single unit that has an input mode and an output mode. Thus, there is an HTML translation table for managing conversions to and from HTML, and there is an CP1250 translation table for managing conversions to and from the CP1250 character set.

List of Tables

The following is a list of the InterSystems IRIS translation tables:

RAW

On Windows, InterSystems IRIS performs no translation for 8-bit characters or 16-bit Latin-1 characters (Unicode characters in which the high-order byte has the value 00).

On UNIX®, if the LANG environment variable specifies an encoding (e.g. "UTF-8" as in "LANG=en_US.UTF-8"), and if that encoding corresponds to a known translation table, then the default system call translation will be set to that table. (Otherwise, InterSystems performs no translation, as in the Windows case.)

RAW translation should not be used for InterSystems IRIS systems using non-Latin-1 locales, such as rusw.

SAME

Translates 8-bit characters to the corresponding Unicode characters.

HTML

Adds (output mode) or removes (input mode) HTML escape characters to a string. See the Output Escaping table.

JS or JSML

Uses a supplied JavaScript translation table to escape characters in the string for use within JavaScript. For output translations, see the Output Escaping table. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.

JSON or JSONML

Uses a supplied translation table to convert to JSON format. For output translations, see the Output Escaping table. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.

URI

Adds (output mode) or removes (input mode) URI parameter escape characters to a string. URI encodes the characters !"#$%&'()*+,/:;<=>?@[]^`{|} as follows: %20%21%22%23%24%25%26%27%28%29%2A%2B%2C%2F%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D.

The space character is encoded as %20.

The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.

URI does not encode the tilde (~) character. See the Output Escaping table.

URI encodes characters higher than $CHAR(255) (Unicode characters) as UTF-8 and then % encodes the UTF-8 values in hexadecimal notation.

Also see Sequential Character Conversion and Character Escaping.

URL

Adds (output mode) or removes (input mode) URL parameter escape characters to a string. URL encodes the characters "#%&+,:;<=>?@[]^`{|}~ as follows: %20%22%23%25%26%2B%2C%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D%7E.

The space character is encoded as %20.

The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.

Refer to the Output Escaping table. Characters higher than $CHAR(255) are represented in Unicode hexadecimal notation: $CHAR(256) = %u0100.

Also see Sequential Character Conversion and Character Escaping.

UTF8

UTF-8 encoding. This converts (output mode) 16-bit Unicode characters to a series of 8-bit characters. An ASCII 16–bit Unicode character translates to a single 8–bit character; for example, hex 0041 (the letter “A”) translates to the 8-bit character hex 41. A non-ASCII Unicode character is converted to two or three 8–bit characters.

Unicode hex 0080 through 07FF convert to two 8–bit characters; these include the Latin-1 Supplement and Latin Extended characters and the Greek, Cyrillic, Hebrew, and Arabic alphabets.

Unicode hex 0800 through FFFF convert to three 8–bit characters; these comprise the rest of the Unicode Basic Multilingual Plane. Thus, the ASCII characters $CHAR(0) through $CHAR(127) are the same in RAW and UTF8 mode; characters $CHAR(128) and above are converted.

Input mode reverses this conversion. Refer to Unicode for further details.

XML

Adds (output mode) or removes (input mode) XML escape characters to a string. See the Output Escaping table.

Other tables

The rest of the translation tables are specific to character set conversion, and these tables have the same name as those character sets. The tables include the following:

  • UnicodeLittle

  • UnicodeBig

  • CP1250

  • CP1251

  • CP1252

  • CP1253

  • CP1255

  • CP437

  • CP850

  • CP852

  • CP866

  • CP874

  • EBCDIC

  • Latin2

  • Latin9

  • LatinC

  • LatinG

  • LatinH

  • LatinT

See Related APIs, which includes a way to list the current translation tables.

Output Escaping

This section indicates how specific translation tables convert characters in output mode:

  HTML JS or JSML JSON or JSONML URI URL XML
null $CHAR(0)   \x00 \u0000 %00 %00 A null character is prohibited in XML
$CHAR(1) through $CHAR(7)   \x01 through \x07 \u0001 through \u0007 %01 through %07 %01 through %07  
backspace $CHAR(8)   \b \b %08 %08  
horizontal tab $CHAR(9)   \t \t %09 %09  
line feed $CHAR(10)   \n \n %0A %0A  
vertical tab $CHAR(11)   \v \u000B %0B %0B  
form feed $CHAR(12)   \f \f %0C %0C  
carriage return $CHAR(13)   \r \r %0D %0D  
$CHAR(14) through $CHAR(31)     \u000E through \u001F %0E through %1F %0E through %1F  
$CHAR(32)       %20 %20  
" (doubled) &quot; \" \” %22 %22 &quot;
#       %23 %23  
$       %24    
%       %25 %25  
& &amp;     %26 %26 &amp;
‘ (apostrophe) $CHAR(39) &#39; \'   %27   &apos;
(       %28    
)       %29    
*       %2A    
+       %2B %2B  
,       %2C %2C  
/ (slash) $CHAR(47)   \/   %2F    
:       %3A %3A  
;       %3B %3B  
< &lt;     %3C %3C &lt;
=       %3D %3D  
> &gt;     %3E %3E &gt;
?       %3F %3F  
@       %40 %40  
[       %5B %5B  
\   \\ \\ %5C %5C  
]       %5D %5D  
^       %5E %5E  
`       %60 %60  
{       %7B %7B  
|       %7C %7C  
}       %7D %7D  
~       This character is not permitted in a URI %7E  
$CHAR(127)       %7F %7F  
$CHAR(128) through $CHAR(159)       %C2%80 through %C2%9F %80 through %9F  
$CHAR(160) &nbsp;     %C2%A0 %A0  
$CHAR(161) through $CHAR(191)       %C2%A1 through %C2%BF %A1 through %BF  
$CHAR(192) through $CHAR(255)       %C3%80 through %C3%BF %C0 through %FF  

For Unicode characters (characters above ASCII 255):

  • The JSML and JSONML translation tables perform escaping, not described here.

  • The URL translation table performs escaping, not described here.

  • The URI translation table is irrelevant because URIs cannot contain characters above ASCII 255. If you attempt to use the URI translation table with such characters, the result is an <ILLEGAL VALUE> error. For example:

    USER>set x=$char(955)
     
    USER>w $ZCVT(x,"O","URI")
     
    W $ZCVT(x,"O","URI")
    ^
    <ILLEGAL VALUE>
    
  • The HTML and XML translation tables do not perform escaping.

Sequential Character Conversion and Character Escaping

In some scenarios, you may want to perform two conversions: one to convert to a different character set, and another to perform character escaping. In such cases, the order of operations is important, and generally you need to convert to the applicable character set and then perform the escaping. In the reverse direction, it is necessary to perform the reverse conversions in the reverse order. An example best demonstrates this. Suppose that we start with a string that uses our local character set, and suppose that this string could potentially include Unicode characters. Suppose that we need to use this string within a URI. A URI can contain only ASCII characters, and within that set of characters, there are specific escape sequences for some characters. In this case, we can convert our string for use in a URI in two steps:

  1. First convert the local representation to ASCII (the UTF-8 character set). For example, given our input string origstring:

     set utf8string = $ZCONVERT(origstring,"O","UTF8")
    
  2. Then apply the character escaping:

     set final = $ZCONVERT(utf8string,"O","URI")
    

    The string final is safe to use within a URI.

To convert a URI back to our local character set, you perform the reverse operation:

  1. Unescape the escaped characters:

     set unescaped=$ZCONVERT(uristring,"I","URI")
    
  2. Convert from UTF–8 to your local representation:

     set local=$ZCONVERT(unescaped,"I","UTF8")
    

As explained above in the entry for the URI translation table, you can also convert directly, skipping the character set conversions; in this case, the $ZCONVERT function converts the character set for you.

Related APIs

For the currently available list of translation tables, refer to the XLTTablesOpens in a new tab property of %SYS.NLS.LocaleOpens in a new tab, as shown in the following example (with line breaks added):

USER>SET nlsoref=##class(%SYS.NLS.Locale).%New()
 
USER>WRITE $LISTTOSTRING(nlsoref.XLTTables,", ")
Unicode, RAW, BIN, SAME, UTF8, UnicodeLittle, UnicodeBig, URL, JS, JSML, JSON, JSONML, HTML, 
XML, XMLA, XMLC, CP1250, CP1251, CP1252, CP1253, CP1255, CP437, CP850, CP852, CP866, CP874, 
EBCDIC, Latin2, Latin9, LatinC, LatinG, LatinH, LatinT

Also, you can use %Net.CharsetOpens in a new tab to represent character sets within InterSystems IRIS. This class includes the following class methods:

  • GetDefaultCharset() returns the default character set for the current InterSystems IRIS locale (see next heading).

    For example:

    USER>w ##class(%Net.Charset).GetTranslateTable("UTF8")
    UTF8
    
  • GetTranslateTable() returns the name of the InterSystems IRIS translation table for a given input character set.

  • TranslateTableExists() indicates whether the translation table for the given character set has been loaded.

For method signatures, see the class documentation for %Net.CharsetOpens in a new tab.

See Also

FeedbackOpens in a new tab