Regular Expressions

Caché supports regular expressions for use with the following ObjectScript functions $LOCATE and $MATCH and methods of the %Regex.MatcherOpens in a new tab class.

All other Caché substring matching operations use the Caché Pattern Matching operators.

This chapter describes the following features of regular expressions:

Wildcard and Quantifiers. Example: .* matches any number of characters of any type.
Literals and Character Ranges. Example: [A-Z] matches a single uppercase character in the range A through Z.
Character Type Meta-Characters are sequences that match a group of characters:
- Single-letter Character Types. Example: \d matches any digit character.
- Unicode Property Character Types. Example: \p{LL} matches any lowercase letter.
- POSIX Character Types. Example: [:print:] matches any printable character.
Grouping Construct uses parentheses to repeatedly apply a regular expression. Example: (\p{LL})+ checks each character to determine if it is a lowercase letter.
Anchors that limit where a match can occur. Example: \b(day) matches only those occurrences of “day” that occur at a word boundary.
Logical Operators. Example: [[:upper:]&&[:greek:]] matches uppercase Greek letters.
Character Representation Meta-Characters are sequences that match a single character.
- Hexadecimal, Octal, and Unicode Representation. Example: \x5A is the hexadecimal representation for the letter Z.
- Control Character Representation. Example: \cM is the carriage return control character.
- Symbol Name Representation. Example: \N{equals sign} is the = character.
Modes. Example: (?i) makes all subsequent matches not case-sensitive.
Comments. Example: (?# date and 24–hour time) inserts this comment into the regular expression string.
Error Messages.

Caché implementation of regular expressions is based on the International Components for Unicode (ICU) standard for regular expressions. Users familiar with Perl regular expressions will find many similarities to the Caché implementation.

Wildcard and Quantifiers

.	Wildcard. Matches any single character of any type, except the line spacing characters $CHAR(10), $CHAR(11), $CHAR(12), $CHAR(13), and $CHAR(133). This exclusion of line spacing characters can be overridden by specifying (?s) single-line mode (as described later in this reference page). Can be used alone “..” = any two characters, or in combination “\d..” = a digit character followed by any two characters of any type. Can be combined with suffixes (with the same line spacing characters restriction): .? = zero or one character of any type. .* = zero or more characters of any type. .+ = one or more characters of any type. .{3} = exactly 3 characters of any type. To end a wildcard sequence, you escape the next literal by using the backslash (\) prefix. For example, the regexp ".*\H\d{2}" matches a string of any characters of any type that ends with the letter “H” followed by a two-digit number.
?	Single-character suffix (0 or 1). Applies regexp 1 or 0 times to string. The regular expressions ”\d?”, “[0–9]?”, or “[[:digit:]]?” all match to either a single number or the empty string. The regular expression “.?(log)” can match “blog” (1 occurrence) or “log” (0 occurrences). The regular expression “abc?” can match either “abc” or “ab”.
+	Repetition suffix (1 or more). Applies regexp one or more times to string. For example, “A+” matches the string “AAAAA”. “.+” matches a string of any length of any character type, but does not match the empty string. The regular expressions ”\d+”, “[0–9]+”, or “[[:digit:]]+” all match a string of numbers of any length. You can use parentheses for complex repeating patterns. For example, (AB)+” matches the string “ABABABAB”; “(\d\d\d\s)+” matches a sequence of any length of three numbers alternating with a single blank space.
*	Repetition suffix (0 or more). Applies regexp zero, one, or more than one times to string. For example, “A” matches the strings ”A”, “AAAAA”, and the empty string. “.” matches a string of any length of any character type, including the empty string. The regular expressions ”\d”, “[0–9]”, or “[[:digit:]]” all match a string of numbers of any length or the empty string. You can use parentheses for complex repeating patterns. For example, (AB)” matches the string “ABABABAB”; “(\d\d\d\s)*” matches a sequence of any length of three numbers alternating with a single blank space.
{n}	Quantification suffix (n times). The {n} suffix applies regexp exactly n number of times. For example, “\d{5}” matches any number with five digits.
{n,}	Quantification suffix (at least n times). The {n,} suffix applies regexp n or more times. For example, “\d{5,}” matches any number with five or more digits.
{n,m}	Quantification suffix (range). The {n,m} suffix applies regexp a minimum of n times and a maximum of m times (inclusive). For example, “\d{7,10}” matches any number of at least 7 digits but not more than 10 digits.

Literals and Character Ranges

Most literal characters can simply be included in a regular expression. For example, the regular expression ".*G.*" specifies that the string must contain the letter G.

Some literal characters are also used as regular expression meta-characters. You must use the escape prefix (the backslash character) before a meta-character that is to be treated as a literal character. The following literal characters require an escape prefix: dollar sign \$; asterisk \*; plus sign \+; period \.; question mark \?; backslash \\; caret \^; vertical bar \|; open and close parentheses ; open and close square brackets \[ \]; open and close curly braces \{ \}. The close square bracket ] does not always require an escape prefix; the escape prefix should be used for clarity and consistency.

The quote character does not take an escape prefix; to specify a literal quote character, double it "".

The following are ways to specify more than one regular expression match for a literal:

[x]

A specified character or list of characters. Thus [A] means that only the uppercase letter character “A” is a match, and [ACE] matches any one of the letters A, C, or E. Characters may be listed in any sequence. Repeated characters are permitted. You can use a caret (^) to specify the inverse; for example, [^A] means that any character except “A” is a match; [^XYZ] means that any character except X, Y, or Z is a match. By default, these character matches are case-sensitive. You can make character matching not case-sensitive by preceding it with the (?i) mode modifier.

To specify a caret (^) as a literal match character it cannot be the first character in the list. To specify a hyphen ($CHAR(45)) as a literal match character it must be the first or last character in the list. To specify a close bracket (]) as a literal match character it must be the first character in the list. (First character can mean the first character after the ^ inverse operator). Backslash escape prefix literals can also be used; for example [\\AB\[CD] matches backslash (\), open bracket ([), and the letters A, B, C, and D.

[x-z]

A range of specified characters beginning with x and ending with z (inclusive). Though commonly used for letters or numbers, any ascending ASCII sequence can be used as a range. Thus [A-Z] is the range for all uppercase letters. [A-z] is a range that includes not only all uppercase and lowercase letters, but the six ASCII punctuation characters between the alphabets. Specifying a range that is not in ascending ASCII sequence generates a <REGULAR EXPRESSION> error. You can also specify multiple ranges. Thus [A-Za-z] is the range for all uppercase and lowercase letters. You can use a caret (^) as the first character after the open bracket to specify the inverse; for example, [^A-F] means all character except A through F. The caret specifies the inversion of all of the specified ranges; thus [^A-Za-z] means any character except a letter. Ranges of characters and lists of single characters can be combined in any sequence. Thus [ABCa-fXYZ0-9] matches the characters specified and the characters within the specified ranges.

(str)

(str1|str2)

A specified string or a list of strings separated by the OR logical operator (|). Thus (William) matches this exact substring in string, and (William|Willy|Wm\.|Bill) matches any of these substrings. You can use the escape prefix \| to specify a vertical bar as a literal within a string. By default, these substring matches are case-sensitive. You can make a substring match not case-sensitive by preceding it with the (?i) mode modifier. By default, these substring matches can occur anywhere in string. You can restrict substring matching to occurrences at a word boundary by preceding it with \b.

Character Type Meta-Characters

Caché regular expressions support three sets of character type meta-characters:

Single-letter character types. For example: \d
Unicode property character types. For example: \p{LL}
POSIX character types. For example [:alpha:]

These character type meta-characters can be used in any regular expression in any combination.

Single-letter Character Types

A single-letter character type meta-character is indicated by the backslash (\) character, followed by a letter. The character type is specified by a lowercase letter (\d = a digit: 0 through 9). For those character types that support inversion, an uppercase letter specifies the inverse of the character type (\D = any character except a digit).

\a	A bell character $CHAR(7). No inverse is supported.
\d	A digit character. The numbers 0 through 9. The inverse is \D.
\e	An escape character $CHAR(27). No inverse is supported.
\f	A form feed character $CHAR(12). No inverse is supported.
\n	A newline character $CHAR(10). No inverse is supported.
\r	A carriage return character $CHAR(13). No inverse is supported.
\s	A spacing character. A blank space, a tab, or a line spacing character, including the following characters: $CHAR(9), $CHAR(10), $CHAR(11), $CHAR(12), $CHAR(13), $CHAR(32), $CHAR(133), and $CHAR(160). The inverse is \S.
\t	A tab character $CHAR(9). No inverse is supported.
\w	A word character. A word character can be a letter, a number, or the underscore character. Valid letters include uppercase and lowercase letters, including Unicode letters. They include the following extended ASCII characters: $CHAR(170), $CHAR(181), $CHAR(186), $CHAR(192) through $CHAR(214), $CHAR(216) through $CHAR(246), $CHAR(248) through $CHAR(256). The inverse is \W.

The \d, \s, and \w meta-characters also match appropriate Unicode characters beyond $CHAR(256).

For meta-character sequences for other individual control characters, see Control Character Representation.

Unicode Property Character Types

Unicode property character type matching matches a single character to a character type specified using the following syntax:

\p{prop}

For example, \p{LL} matches any lowercase letter. A prop keyword consists of one or two letter characters; prop keywords are not case-sensitive. The single-letter prop keywords are the most inclusive; two-letter prop keywords specify a subset.

The inverse is \P{prop}. For example, \P{LL} matches any character that is not a lowercase letter.

The following table shows the characters that match each prop keyword for the first 256 characters (an example Unicode character is provided for the prop keywords that do not match any of the 256 characters):

C: control and miscellaneous characters 0–31, 127–159, 173	CC: control characters 0–31, 127–159	CF: formatting characters 173	CN: unassigned code points (for example, 888)	CO: private use characters (for example, 57344)	CS: surrogates (for example, 55296)
L: letters 65-90, 97–122, 170, 181, 186, 192–214, 216–246, 248–255	LL: lowercase letters 97–122, 170, 181, 186, 223–246, 248–255	LM: modifier letters (for example, 688)	LO: other letters not LL, LU, LT, or LM (for example, 443)	LT: titlecase letters (for example 453)	LU: uppercase letters 65-90, 192–214, 216–222
M: marks (for example, 768)	MC: modification characters (for example, 2307)	ME: marks that enclose (for example, 1160)	MN: accent marks (for example, 768)
N: numbers 48–57, 178–179, 185, 188–190	ND: decimal numbers 48–57	NL: letters representing numbers (for example, 5870)	NO: number subscripts and fractions 178–179, 185, 188–190
P: punctuation 33–35, 37–42, 44–47, 58–59, 63–64, 91–93, 95, 123, 125, 161, 171, 183, 187, 191	PC: connecting punctuation 95	PD: dashes 45	PE: closing punctuation 41, 93, 125 PS: opening punctuation 40, 91, 123	PI: initial punctuation 171 PF: final punctuation 187	PO: other punctuation 33–35, 37–39, 42, 44, 46–47, 58–59, 63–64, 92, 161, 183, 191
S: symbols 36, 43, 60–62, 94, 96, 124, 126, 162–169, 172, 174–177, 180, 182, 184, 215, 247	SC: currency symbols 36, 162–165	SK: combining symbols 94, 96, 168, 175, 180, 184	SM: math symbols 43, 60–62, 124, 126, 172, 177, 215, 247	SO: other symbols 166–167, 169, 174, 176, 182
Z: separators 32, 160	ZL: line separators (for example, 8232)	ZP: paragraph separators (for example, 8233)	ZS: space characters 32, 160

You can use the following code to determine which characters match with a prop keyword:

  READ prop#2:10
  READ rangefrom:10
  READ rangeto:10
  FOR i=rangefrom:1:rangeto {
      IF $MATCH($CHAR(i),"\p{"_prop_"}")=1 {
         WRITE i,"=",$CHAR(i),!} }

POSIX Character Types

POSIX syntax matches a single character to a character type specified by a ptype keyword using either of the following syntax forms:

\p{ptype}
[:ptype:]

For example, [:lower:] or \p{lower} matches any lowercase letter. You can specify the inverse (match anything except a lowercase letter) as follows: [:^lower:] or \P{lower}.

The ptype keywords are not case-sensitive. The general ptype keywords are:

alnum — letters and numbers.
alpha — letters.
blank — the tab $CHAR(9) or space $CHAR(32), $CHAR(160).
cntrl — control characters: $CHAR(0) through $CHAR(31), $CHAR(127) through $CHAR(159).
digit — the numbers 0 through 9.
graph — printable characters, excluding the space character: $CHAR(33) thorough $CHAR(126), $CHAR(161) thorough $CHAR(156).
lower — lowercase letters.
math — mathematics characters (a subset of symbol). Includes the following characters: +<=>^|~¬±×÷
print — printable characters, including the space character: $CHAR(32) thorough $CHAR(126), $CHAR(160) thorough $CHAR(156).
punct — punctuation characters (excludes symbol characters). Includes the following characters: !"#%&'()*,-./:;?@[\]_{}¡«·»¿
space — spacing characters, including the blank space, tab, and line spacing characters, including the following characters: $CHAR(9), $CHAR(10), $CHAR(11), $CHAR(12), $CHAR(13), $CHAR(32), $CHAR(133), and $CHAR(160).
symbol — symbol characters (excludes punctuation characters). Includes the following characters: $+<=>^`|~¢£¤¥¦§¨©¬®¯°±´¶¸×÷
upper — uppercase letters.
xdigit — hexadecimal digits: the numbers 0 through 9, the uppercase letters A through F, the lowercase letters a through f.

In addition, you can use ptype to specify a Unicode category. For example, [:greek:] matches any character in the Unicode Greek category (this includes the Greek letters which are found in the range $CHAR(900) through $CHAR(974)). A partial list of these POSIX Unicode categories includes: [:arabic:], [:cyrillic:], [:greek:], [:hebrew:], [:hiragana:], [:katakana:], [:latin:], [:thai:]. These Unicode categories can also be represented as [:script=greek:], for example.

The following example uses POSIX matching to compare the [:letter:] character set and the [:latin:] character set in the first 256 characters. They differ by a single character, $CHAR(181):

   FOR i=0:1:255 {
     SET letr="foo"
     IF 1=$MATCH($CHAR(i),"[:letter:]") {
      SET letr=$CHAR(i)}
     IF 1=$MATCH($CHAR(i),"[:latin:]") {
          SET lat=$CHAR(i)}
      ELSE {SET lat="foo"}
     IF letr '= lat {WRITE i," ",$CHAR(i),!}
   }

Grouping Construct

You can use parentheses to specify a literal or meta-character sequence applied repeatedly. For example, the regular expression ([0–9])+ tests each successive character in a string to determine if it is a number.

This usage is shown in the following examples:

  WRITE $MATCH("4567683285759","([0-9])+"),!
      // test for all numbers, no empty string
  WRITE $MATCH("4567683285759","([0-9])*"),!
      // test for all numbers or for empty string
  WRITE $MATCH("Now is the time","\p{LU}(\p{L}|\s)+"),!
      // test for initial uppercase letter, then all letters or spaces
  WRITE $MATCH("MAboston-9a","\p{LU}{2}(\p{LL}|\d|\-)*"),!
      // test for 2 uppercase letters, then all lowercase, numbers, dashes, or ""
  WRITE $MATCH("1^23^456^789","([0-9]+\^?)+"),!
      // test for one or more numbers followed by 0 or 1 ^ characters, apply test repeatedly
  WRITE $MATCH("$1,234,567,890.99","\$([0-9]+,?)+\.\d\d")
      // test for $, then numbers followed by 0 or 1 comma, then decimal point, then 2 fractional digits

Note:

Because grouping constructs apply a regular expression repeatedly, it is possible to create a matching operation that takes a long time to complete.

The following cautionary example shows how the execution time for a repeatedly applied grouping construct increases rapidly depending on the position of the pattern match error in the string. The more permutations that must be tested before declaring a non-match, the longer the execution time:

  SET a=$ZHOROLOG
    WRITE $MATCH("1111111111,2222222222,3333333333","([0-9]+,?)+")
    SET b=$ZHOROLOG-a
    WRITE " duration: ",b,!
  SET a=$ZHOROLOG
    WRITE $MATCH("11111x11111,2222222222,3333333333","([0-9]+,?)+")
    SET b=$ZHOROLOG-a
    WRITE " duration: ",b,!
  SET a=$ZHOROLOG
    WRITE $MATCH("1111111111,22x22222222,3333333333","([0-9]+,?)+")
    SET b=$ZHOROLOG-a
    WRITE " duration: ",b,!
  SET a=$ZHOROLOG
    WRITE $MATCH("1111111111,2222222x222,3333333333","([0-9]+,?)+")
    SET b=$ZHOROLOG-a
    WRITE " duration: ",b,!
  SET a=$ZHOROLOG
    WRITE $MATCH("1111111111,22222222x22,3333333333","([0-9]+,?)+")
    SET b=$ZHOROLOG-a
    WRITE " duration: ",b

Anchor Meta-Characters

An anchor is a meta-character that limits the regular expression match associated with it to a particular place in the match string. For example, a match can only occur at the beginning or end of the string, or after a space character in the string.

String Beginning or End

These anchors limit matching to the beginning or end of the string.

^ \A	Beginning of string anchor prefix. Indicates that the regular expression match must occur at the beginning of the string.
$	End of string anchor suffix. Indicates that the regular expression match must occur at the end of the string. End-of-line characters (ASCII 10, 11, 12, or 13) are ignored. Same as \Z.
\Z	End of string anchor suffix. Indicates that the regular expression match must occur at the end of the string. End-of-line characters (ASCII 10, 11, 12, or 13) are ignored. Same as $.
\z	End of string anchor suffix. Indicates that the regular expression match must occur at the end of the string. End-of-line characters (ASCII 10, 11, 12, or 13) are treated as string characters for matching.

The following example shows how a beginning of string anchor limits a $LOCATE match:

   SET str="ABCDEFG"
   WRITE $LOCATE(str,"A"),!   // returns 1
   WRITE $LOCATE(str,"D"),!   // returns 4
   WRITE $LOCATE(str,"^A"),!  // returns 1
   WRITE $LOCATE(str,"^D"),!  // returns 0 (no match)

The following example shows how an end of string anchor limits a $LOCATE match:

   SET str="ABCDABCD"
   WRITE $LOCATE(str,"(ABC)"),!   // returns 1
   WRITE $LOCATE(str,"D"),!       // returns 4
   WRITE $LOCATE(str,"(ABC)$"),!  // returns 0 (no match)
   WRITE $LOCATE(str,"(ABCD)$"),! // returns 5
   WRITE $LOCATE(str,"D$"),!      // returns 8

The following example shows how end-of-string anchors handle a line feed character:

   SET str="ABCDEFG"_$CHAR(10)

   WRITE $LOCATE(str,"G$"),!                   // returns 7
   WRITE $LOCATE(str,"G"_$CHAR(10)_"$"),!      // returns 7
   WRITE $LOCATE(str,$CHAR(10)_"$"),!!         // returns 8

   WRITE $LOCATE(str,"G\Z"),!                  // returns 7
   WRITE $LOCATE(str,"G"_$CHAR(10)_"\Z"),!     // returns 7
   WRITE $LOCATE(str,$CHAR(10)_"\z"),!!        // returns 8

   WRITE $LOCATE(str,"G\z"),!                  // returns 0
   WRITE $LOCATE(str,"G"_$CHAR(10)_"\z"),!     // returns 7
   WRITE $LOCATE(str,$CHAR(10)_"\z"),!         // returns 8

Word Boundary

You can limit matching to occurrences at a word boundary. A word boundary is identified by a word character next to a non-word character, or a word character at the beginning of the string. Word characters are those that match the \w character type: letters, numbers, and the underscore character. Commonly, this is the first letter(s) of a word at the beginning of string or following a space character or other punctuation. The regular expression syntax for a word boundary is:

\b matches an occurrence at a non-word character/word character boundary, or a word character at the beginning of a string.
\B (the inverse) matches an occurrence at a word character/word character boundary, or at a non-word character/non-word character boundary.

The following example use \b to match word boundaries that begin with the substring “in” or “un”:

  SET str(1)="unlucky"          // match: "un" is at start of string
  SET str(2)="highly unlikely"  // match: "un" follows a space character
  SET str(3)="fall in place"    // match: "in" can be followed by a space
  SET str(4)="the %integer"     // match: % is a non-word character
  SET str(5)="down-under"       // match: - is a non-word character
  SET str(6)="winning"          // no match: "in" preceded by word character
  SET str(7)="the 4instances"   // no match: a number is a word character
  SET str(8)="down_under"       // no match: an underscore is a word character
  FOR i=1:1:8 {
      WRITE $MATCH(str(i),".*\b[iu]n.*")," string",i,!
      }

The following example uses \B to locate the regular expression when it is not at a word boundary:

   SET str(1)="the thirteenth item"
   WRITE $LOCATE(str(1),"\Bth")   // returns 13 ("th" preceded by a word character)
   SET str(2)="the^thirteenth^item"

The following example show how \b and \B can be used in a regular expression that does not specify a word character:

   SET str(1)="this##item"
   WRITE $LOCATE(str(1),"\b#"),!   // returns 5 (the first # at a word boundary)
   WRITE $LOCATE(str(1),"\B#")     // returns 6 (the first # not at a word boundary)

Logical Operators

You can represent compound character types by combining values with logical AND (&&), logical OR (|), and subtract (– –) operators. A compound character type must be enclosed in square brackets.

Implicit OR: You can use square brackets without logical operators to specify lists or ranges of matching characters, one of which must be true. The following examples match all uppercase letters and the numbers 1234: [\p{LU}1234] or [[:upper:]1234], [\p{LU}1-4] or [[:upper:]1-4].

AND (&&): You can use logical AND to specify multiple character type meta-characters, both of which must be true. For example, to limit a match to only uppercase Greek letters, you could specify: [\p{LU}&&\p{greek}] or [[:upper:]&&[:greek:]].

OR (|): You can use logical OR to specify multiple character type meta-characters, either of which must be true. For example, to limit a match to either numbers or Greek letters, you could specify: [\p{N}|\p{greek}] or [[:digit:]|[:greek:]]. Note that this use of an explicit OR is optional; a list of character types without logical operators is interpreted as logical OR.

SUBTRACT (– –): You can use logical subtract to specify multiple character type meta-characters, the first of which must be true and the second of which must be false. For example, to limit a match all uppercase letters except Greek letters, you could specify: [\p{LU}--\p{greek}] or [[:upper:]--[:greek:]].

Character Representation Meta-Characters

The following are meta-character representations of individual characters. Each sequence matches with a single character.

Note that a few individual control characters ($CHAR(7), $CHAR(9), $CHAR(10), $CHAR(12), $CHAR(13), and $CHAR(27)) can also be represented using a single-letter character type.

Hexadecimal, Octal, and Unicode Representation

\xnn

\x{nnn}

Hexadecimal representation. For example, \x5A is the letter ‘Z’. Note that the hex letters A through F are not case-sensitive. Leading zeros can be included or omitted.

\xnn can be used for one-digit or two-digit hexadecimal numbers. For hexadecimal numbers with more digits you must use the \x{nnn} curly brace syntax, where nnn can be from 1 to 7 hex digits, with a maximum value of 010FFFF. For example, \x{005A} is the letter ‘Z’, \x{396} is the Greek letter zeta.

\0nnn

Octal representation. The nnn value is an octal value of two, three, or four digits; however, the leftmost digit must be a zero. For example, the carriage return character $CHAR(13) can be represented by \015 or \0015. The maximum value is \0377, which is $CHAR(255).

\unnnn

Unicode representation. The nnnn value is a four-digit hexadecimal number corresponding to the Unicode character. For example, \u005A is the letter ‘Z’ ($CHAR(90); \u03BB is the Greek lowercase lambda ($CHAR(955)).

Control Character Representation

Control characters are the non-printing ASCII characters $CHAR(0) through $CHAR(31). They can be represented using the following syntax:

\cX

where X is a letter or symbol that corresponds to an ASCII control character (characters 0 through 31). Letters correspond to $CHAR(1) through $CHAR(26). For example, \cH is $CHAR(8), the backspace character. An X letter is not case-sensitive. The non-letter control characters follow the same ASCII character set sequence, as follows: $CHAR(0) = \c@ or \c`, $CHAR(27) = \c{ or \c[, $CHAR(28) = \c| or \c\, $CHAR(29) = \c} or \c], $CHAR(30) = \c^ or \c~, $CHAR(31) = \c_.

Symbol Name Representation

This character type can be used to match single printable punctuation, space, and symbol characters. The syntax is as follows:

\N{charname}

For example, \N{comma} matches a comma. Note that the meta-character \N must be an uppercase letter.

The supported character names include: acute accent (´), ampersand (&), apostrophe ('), asterisk (*), breve (˘), cedilla (¸), colon (:), comma (,), dagger (†), degree sign (°), division sign (÷), dollar sign ($), double dagger (‡), em dash (—), en dash (–), exclamation mark (!), equals sign (=), full stop (.), grave accent (`), infinity (∞), left curly bracket ({), left parenthesis ((), left square bracket ([), macron (¯), multiplication sign (×), plus sign (+), pound sign (#), prime (′), question mark (?), right curly bracket (}), right parenthesis ()), right square bracket (]), semicolon (;), space ( ), square root (√), tilde (~), vertical line (|). Also supported are subscript zero though subscript nine and superscript zero though superscript nine.

Modes

A mode changes the interpretation of the character matches that follows it. The mode is specified by a single lowercase letter. There are two ways to use modes:

Mode for a regular expression sequence. For example: (?i)
Mode for a specified literal within a regular expression. . For example: (?i:(fred|ginger))

The following mode characters are supported:

(?i)	Case mode. When active, letter case is disregarded when matching uppercase and lowercase letters to a regular expression.
(?m)	Multi-line mode. Affects the behavior of ^ (beginning of string) and $ (end of string) anchors, when applied to a multi-line string. By default these anchors apply to the entire string. When multi-line mode is active, these anchors apply to the beginning and end of each line within a multi-line string. A line can be begun by any of the newline characters: 10, 11, 12, 13, 133 (and Unicode 8232 and 8233).
(?s)	Single-line mode. When off, the dot (.) wildcard does not match the newline characters: 10, 11, 12, 13, 133 (and Unicode 8232 and 8233). When on, the dot (.) wildcard matches all characters, including newline characters. Note that the pair of characters carriage return ($CHAR(13)) and line feed ($CHAR(10)), when specified in that order, are counted in a regular expression as a single character.
(?x)	Free-spacing mode. Allows for whitespace and trailing comments in a regular expression.

Mode for a Regular Expression Sequence

A regexp mode governs regular expression interpretation from the point where it is applied to the end of the regular expression, or until explicitly turned off. The syntax is as follows:

(?n)  to turn mode on
(?-n) to turn mode off

Where n is a single lowercase letter that specifies the mode type.

The following example shows case mode (?i):

  WRITE $MATCH("A","(?i)[abc]"),!
  WRITE $MATCH("a","(?i)[abc]")

The following example shows case mode (?i). The first regular expression is case-sensitive. The second regular expression begins with the case mode modifier (?i) makes the regular expression not case-sensitive:

  SET name(1)="Smith,John"
  SET name(2)="dePaul,Lucius"
  SET name(3)="smith,john"
  SET name(4)="John Smith"
  SET name(5)="Smith,J"
  SET name(6)="R2D2,CP30"
  SET n=1
  WHILE $DATA(name(n)) {
    IF $MATCH(name(n),"\p{LU}\p{LL}+,\p{LU}\p{LL}+")
      { WRITE name(n)," : case match",! }
    ELSEIF $MATCH(name(n),"(?i)\p{LU}\p{LL}+,\p{LU}\p{LL}+")
      { WRITE name(n)," : non-case match",! }
    ELSE { WRITE name(n)," : not a valid name",! }
    SET n=n+1 }

The following example shows single-line mode (?s), which allows ".*" to match a string containing newline characters:

  SET line(1)="This is a string without line breaks."
  SET line(2)="This is a string with"_$CHAR(10)_"one line break."
  SET line(3)="This is a string"_$CHAR(11)_"with"_$CHAR(12)_"two line breaks."
  SET i=1
  WHILE $DATA(line(i)) {
    IF $MATCH(line(i),".*") {WRITE "line(",i,") is a single line string",! }
    ELSEIF $MATCH(line(i),"(?s).*") {WRITE "line(",i,") is a multiline string",! }
    ELSE {WRITE "string error",! }
    SET i=i+1 }

The following example shows in single-line mode (?s) that the carriage return/line feed pair (in that order) are counted in a regular expression as one character:

  SET str(1)="one"_$CHAR(13)_$CHAR(10)_"two"   // CR/LF
  SET str(2)="one"_$CHAR(10)_$CHAR(13)_"two"   // LF/CR
  SET i=1
  WHILE $DATA(str(i)) {
     WRITE $LENGTH(str(i))," is the length of string ",i,!
     IF $MATCH(str(i),"(?s).{7}") { WRITE "string ",i," matches 7 chars",! }
     ELSEIF $MATCH(str(i),"(?s).{8}") { WRITE "string ",i," matches 8 chars",! }
     ELSE { WRITE "string match error",! }
     SET i=i+1
   }

The following example shows multi-line mode (?m). It locates the substring identified by the end anchor ($). In single-line mode, this end substring is always “break”, the last substring in the string. In multi-line mode the end substring can be any of the substrings that end a line within a multi-line string:

  SET line(1)="String without line break"
  SET line(2)="String with"_$CHAR(10)_" one line break"
  SET line(3)="String"_$CHAR(11)_" with"_$CHAR(12)_" two line break"
  SET i=1
  WHILE $DATA(line(i)) {
    WRITE $LOCATE(line(i),"(String|with|break)$")," line(",i,") in single-line mode",! 
    WRITE $LOCATE(line(i),"(?m)(String|with|break)$")," line(",i,") in multi-line mode",!!
    SET i=i+1 }

Mode for a Literal

You can also apply a mode modifier to a literal (or a set of literals), using the syntax:

(?mode:literal)

This mode modification applies just to the literal(s) within the parentheses.

The following case mode (?i) example matches last names (lname) that begin with the de, del, dela, and della, regardless of the capitalization of this prefix. The rest of lname must begin with a capital letter, followed by at least one lowercase letter:

  SET lname(1)="deTour"
  SET lname(2)="DeMarco"
  SET lname(3)="DeLaRenta"
  SET lname(4)="DelCarmine"
  SET lname(5)="dellaRobbia"
  SET i=1
  WHILE $DATA(lname(i)) {
     WRITE $MATCH(lname(i),"(?i:de|del|dela|della)\p{LU}\p{LL}+")," = ",lname(i),!
     SET i=i+1 }

Comments

Within a regular expression you can specify two types of comments:

Embedded comments
Line end comment (in (?x) mode only)

Embedded Comments

You can include embedded comments within a regular expression by using the following syntax:

(?# comment)

The following example show the use of comments within a regular expression to document that this format match is for an American format date (MM/DD/YYYY), not a European format date (DD/MM/YYYY):

   WRITE $MATCH("04/28/2012","^[01]\d(?# months)/[0123]\d(?# days)/\d\d\d\d$")

Line End Comment

When free-spacing mode (?x) is in effect, you can include a comment at the end of a regular expression using the following syntax:

# comment

The following example shows an end comment in free-spacing mode:

   WRITE $MATCH("04/28/2012","^[01]\d/[0123]\d/\d\d\d\d$")," no comment",!
   WRITE $MATCH("04/28/2012","^[01]\d/[0123]\d/\d\d\d\d$# date test")," comment no (?x) mode",!
   WRITE $MATCH("04/28/2012","(?x)^[01]\d/[0123]\d/\d\d\d\d$# date test")," comment in (?x) mode",!

In free-spacing mode, whitespace can be included within the regular expression.

Error Messages

An improperly specified regexp generates a <REGULAR EXPRESSION> error. To determine the type of error, you can invoke the LastStatus()Opens in a new tab method, as shown in the following example:

  TRY {
    WRITE "TRY block:",!
    WRITE $MATCH("A","\p{LU}"),!  // good regexp
    WRITE $MATCH("A","\p{}"),!    // bad regexp
  }
  CATCH exp {
    WRITE !,"CATCH block exception handler:",!
    IF 1=exp.%IsA("%Exception.SystemException") {
      WRITE "System exception",!
      WRITE "Name: ",$ZCVT(exp.Name,"O","HTML"),!
      WRITE "Location: ",exp.Location,!
      WRITE "Code: ",exp.Code,!! }
    ELSE {WRITE "Unexpected exception type",!  RETURN }
    WRITE "%Regex.Matcher status:"
    DO $SYSTEM.Status.DisplayError(##class(%Regex.Matcher).LastStatus())
    RETURN
  }

For a list of these errors, refer to General Error Messages 8300 through 8352 in the Caché Error Reference.