Skip to main content

Translation Tables

InterSystems IRIS® data platform uses translation tables (also known as I/O tables) for the task of converting characters. Some API calls (and the $zconvert function) can accept a translation table as an argument. This page provides reference information on the available translation tables.

Introduction

There are two general scenarios in which translation tables are used to convert characters:

  • In many contexts (such as in URLs, in HTML, in JSON, and so on), specific characters are disallowed and must be represented by escape sequences. In this case, it is necessary to convert the characters to or from the allowed set of characters.

  • If you are reading from a source outside the database or writing to a destination outside the database, that entity may expect a different character set than InterSystems IRIS uses. In this case, it is necessary to convert the character encoding.

The “translation table” for a given context is actually a pair of tables. One table specifies how to convert from the default character set to the foreign character set (or to the foreign context), and other specifies how to convert in the other direction. In InterSystems IRIS, the convention is to refer to this pair of tables as a single unit that has an input mode and an output mode. Thus, there is an HTML translation table for managing conversions to and from HTML, and there is an CP1250 translation table for managing conversions to and from the CP1250 character set.

List of Tables

The following is a list of the InterSystems IRIS translation tables:

RAW

Performs no translation for 8-bit characters or 16-bit Latin-1 characters (Unicode characters in which the high-order byte has the value 00).

RAW translation should not be used for InterSystems IRIS systems using non-Latin-1 locales, such as rusw.

SAME

Translates 8-bit characters to the corresponding Unicode characters.

HTML

Adds (output mode) or removes (input mode) HTML escape characters to a string. See the Output Escaping table.

JS or JSML

Uses a supplied JavaScript translation table to escape characters in the string for use within JavaScript. For output translations, see the Output Escaping table. For comparison of JS and JSML, see JS and JSML, JSON and JSONML Conversions. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.

JSON or JSONML

Uses a supplied translation table to convert to JSON format. For output translations, see the Output Escaping table. For comparison of JSON and JSONML, see JS and JSML, JSON and JSONML Conversions. For input translations, “\0”, “\000”, “\x00”, and “\u0000” are all valid escape sequences for NULL.

URI

Adds (output mode) or removes (input mode) URI parameter escape characters to a string. URI encodes the characters !"#$%&'()*+,/:;<=>?@[]^`{|} as follows: %20%21%22%23%24%25%26%27%28%29%2A%2B%2C%2F%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D.

The space character is encoded as %20.

The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.

URI does not encode the tilde (~) character. See the Output Escaping table.

URI encodes characters higher than $CHAR(255) (Unicode characters) as UTF-8 and then % encodes the UTF-8 values in hexadecimal notation.

Also see URL and URI Conversions.

URL

Adds (output mode) or removes (input mode) URL parameter escape characters to a string. URL encodes the characters "#%&+,:;<=>?@[]^`{|}~ as follows: %20%22%23%25%26%2B%2C%3A%3B%3C%3D%3E%3F%40%5B%5D%5E%60%7B%7C%7D%7E.

The space character is encoded as %20.

The double quote character (which must be escaped by doubling when included in a quoted string such as "My ""perfect"" code") is encoded as %22.

Refer to the Output Escaping table. Characters higher than $CHAR(255) are represented in Unicode hexadecimal notation: $CHAR(256) = %u0100.

Also see URL and URI Conversions.

UTF8

UTF-8 encoding. This converts (output mode) 16-bit Unicode characters to a series of 8-bit characters. An ASCII 16–bit Unicode character translates to a single 8–bit character; for example, hex 0041 (the letter “A”) translates to the 8-bit character hex 41. A non-ASCII Unicode character is converted to two or three 8–bit characters.

Unicode hex 0080 through 07FF convert to two 8–bit characters; these include the Latin-1 Supplement and Latin Extended characters and the Greek, Cyrillic, Hebrew, and Arabic alphabets.

Unicode hex 0800 through FFFF convert to three 8–bit characters; these comprise the rest of the Unicode Basic Multilingual Plane. Thus, the ASCII characters $CHAR(0) through $CHAR(127) are the same in RAW and UTF8 mode; characters $CHAR(128) and above are converted.

Input mode reverses this conversion. Refer to Unicode for further details.

XML

Adds (output mode) or removes (input mode) XML escape characters to a string. See the Output Escaping table.

Other tables

The rest of the translation tables are specific to character set conversion, and these tables are named the same as those character sets. The tables include the following:

  • UnicodeLittle

  • UnicodeBig

  • CP1250

  • CP1251

  • CP1252

  • CP1253

  • CP1255

  • CP437

  • CP850

  • CP852

  • CP866

  • CP874

  • EBCDIC

  • Latin2

  • Latin9

  • LatinC

  • LatinG

  • LatinH

  • LatinT

See Related APIs, which includes a way to list the current translation tables.

Output Escaping

This section indicates how specific translation tables convert characters in output mode:

  HTML JS JSON URI URL XML
null $CHAR(0)   \x00 \u0000 %00 %00  
$CHAR(1) through $CHAR(7)   \x01 through \x07 \u0001 through \u0007 %01 through %07 %01 through %07  
backspace $CHAR(8)   \b \b %08 %08  
horizontal tab $CHAR(9)   \t \t %09 %09  
line feed $CHAR(10)   \n \n %0A %0A  
vertical tab $CHAR(11)   \v \u000B %0B %0B  
form feed $CHAR(12)   \f \f %0C %0C  
carriage return $CHAR(13)   \r \r %0D %0D  
$CHAR(14) through $CHAR(31)     \u000E through \u001F %0E through %1F %0E through %1F  
$CHAR(32)       %20 %20  
" (doubled) &quot; \" \” %22 %22 &quot;
#       %23 %23  
$       %24    
%       %25 %25  
& &amp;     %26 %26 &amp;
‘ (apostrophe) $CHAR(39) &#39; \'   %27   &apos;
(       %28    
)       %29    
*       %2A    
+       %2B %2B  
,       %2C %2C  
/ (slash) $CHAR(47)   \/   %2F    
:       %3A %3A  
;       %3B %3B  
< &lt;     %3C %3C &lt;
=       %3D %3D  
> &gt;     %3E %3E &gt;
?       %3F %3F  
@       %40 %40  
[       %5B %5B  
\   \\ \\ %5C %5C  
]       %5D %5D  
^       %5E %5E  
`       %60 %60  
{       %7B %7B  
|       %7C %7C  
}       %7D %7D  
~         %7E  
$CHAR(127)       %7F %7F  
$CHAR(128) through $CHAR(159)       %C2%80 through %C2%9F %80 through %9F  
$CHAR(160) &nbsp;     %C2%A0 %A0  
$CHAR(161) through $CHAR(191)       %C2%A1 through %C2%BF %A1 through %BF  
$CHAR(192) through $CHAR(255)       %C3%80 through %C3%BF %C0 through %FF  

URL and URI Conversions

A URL or URI can only contain certain 8-bit ASCII characters. All other characters must be represented by an escape sequence beginning with %. If you wish to convert a string containing Unicode characters to a URL or URI, you must first convert your local representation to an 8-bit intermediate representation, using UTF-8 encoding. You then convert the UTF-8 results to URL encoding. To convert a URL back to its original Unicode string, you perform the reverse operation. This is shown in the following example:

  SET ustring="US$ to "_$CHAR(8364)_" échange"
  WRITE "initial string is: ",ustring,!
ConvertUnicodeToURL
  SET utfo = $ZCONVERT(ustring,"O","UTF8")
  SET urlo = $ZCONVERT(utfo,"O","URL")
  WRITE "Unicode to URL conversion: ",urlo,!
ConvertURLtoUnicode
  SET urli = $ZCONVERT(urlo,"I","URL")
  SET utfi = $ZCONVERT(urli,"I","UTF8")
  WRITE "URL to Unicode conversion: ",utfi

JS and JSML, JSON, and JSONML Conversions

The JS and JSON translations use UTF-8 encoding for Unicode characters. The JSML and JSONML translations render Unicode characters without encoding. For ASCII characters ($CHAR(0) through $CHAR(127)), JS and JSML encodings are identical. For ASCII characters ($CHAR(0) through $CHAR(127)), JSON and JSONML encodings are identical.

The following example compares the translation of JS and JSML characters:

  FOR i=1:1:256 {
      SET x=$ZCVT($C(i),"O","JS")
      SET y=$ZCVT($C(i),"O","JSML") 
      IF x=y {
        WRITE "."
      } ELSE {
        WRITE !!,$ZHEX(i),!,"JS: " ZZDUMP x WRITE !,"JSML: " ZZDUMP y 
      }
  }

Related APIs

For the currently available list of translation tables, refer to the XLTTablesOpens in a new tab property of %SYS.NLS.LocaleOpens in a new tab, as shown in the following example:

  SET nlsoref=##class(%SYS.NLS.Locale).%New()
  WRITE $LISTTOSTRING(nlsoref.XLTTables,", ")

Also, you can use %Net.CharsetOpens in a new tab to represent character sets within InterSystems IRIS. This class includes the following class methods:

  • GetDefaultCharset() returns the default character set for the current InterSystems IRIS locale (see next heading).

  • GetTranslateTable() returns the name of the InterSystems IRIS translation table for a given input character set.

  • TranslateTableExists() indicates whether the translation table for the given character set has been loaded.

For method signatures, see the class documentation for %Net.CharsetOpens in a new tab.

See Also

FeedbackOpens in a new tab