Skip to main content

%Regex.Matcher

class %Regex.Matcher extends %Library.RegisteredObject, %SYSTEM.Help

The Class %Regex.Matcher creates an object that does pattern matching using regular expressions. The regular expressions come from the International Components for Unicode (ICU). The ICU maintains web pages at https://icu.unicode.orgOpens in a new tab.

The definition and features of the ICU regular expression package can be found in https://unicode-org.github.io/icu/userguide/strings/regexp.htmlOpens in a new tab.

On most platforms, installing InterSystems IRIS will also install an appropiate version of the ICU libraries. On platforms that do not have an ICU library available, evaluating any regular expression function or method will result in an <UNIMPLEMENTED> error.

A %Regex.Matcher object can be created by evaluating
##class(%Regex.Matcher).%New(pattern) or
##class(%Regex.Matcher).%New(pattern,text).
The first parameter to %New() becomes the inital value of the property Pattern. The optional, second parameter to %New() become the inital value of the property Text. Setting property Pattern to a regular expression pattern string causes that regular expression pattern to be compiled into a Matcher object where it can be used to do multiple matching operations without being recompiled. The property Text contains the subject text string that is searched by a regular expressions match. Note that an empty string is considered to be an illegal regular expression so the first parameter to %New() cannot be missing nor be the empty string.

If x is a %Regex.Matcher object then the built-in method %ConstructClone() can be used to copy x ( Set xnew = x.%ConstructClone() ) . The state of the most recent match and any error value in the Status property are not cloned. The %ConstructClone() method can be faster than creating a new Matcher with the same Pattern. The %ConstructClone() method can just copy instructions for the matching engine rather than recompiling the original pattern string. On 8-bit systems %ConstructClone() can just copy the Unicode versions of the Pattern and Text properties without need to do the character-by-character conversion from the NLS 8-bit character set into Unicode.

None of the methods or operations in the %Regex.Matcher package return a %Status value. When an error is detected, these operations always throw the system exception thrown by the kernel code that interfaces to the ICU library. If a program wants to recover from a regular expression error then it is recommended that the code doing regular expression operations be surrounded with a TRY {...} block and that the error recovery be done in the corresponding CATCH {...} block. Note that a TRY block imposes no run-time performance overhead in situations where no error occurs.

The methods and operations in a %Regex.Matcher object will catch any <REGULAR EXPRESSION> system error and will generate a %Status value that may better describe that error. That %Status value will be stored in the Status property of the %Regex.Matcher object and in the variable %objlasterror. After saving the %Status value, the original unmodified <REGULAR EXPRESSION> system exception will be rethrown. You may examine that %Status value by executing the following InterSystems IRIS Object Script command:
do $system.Status.DisplayError(%objlasterror)

Some other system errors, like <STRING STACK>, are passed through the %Regex.Matcher methods without modification.

Note that some ICU operation errors are not considered errors by the %Regex.Matcher package. Examples are evaluating the Start and End properties when the previous matching operation failed. In these cases Start and End have value -2 as a character position rather than throwing an error.

Examples:

Regular expression that finds titles M., Mr., Mrs. and Ms. in a string: "\bMr?s?\."
"\b" matches a break at the beginning (or ending) of a word
"M" matches an upper-case letter-M
"r?" matches 0 or 1 occurences of a lower-case letter-r
"s?" matches 0 or 1 occurences of a lower-case letter-s
"\." matches a period character

USER>set matcher=##class(%Regex.Matcher).%New("\bMr?s?\.")                             
USER>set matcher.Text="Mrs. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance"
USER>while matcher.Locate() {write "Found ",matcher.Group," at position ",matcher.Start,!}      
Found Mrs. at position 1
Found Mr. at position 19
Found Ms. at position 37
Found M. at position 54
USER>write matcher.ReplaceAll("Dr.")
Dr. Sally Jones, Dr. Mike McMurry, Dr. Amy Johnson, Dr. Maurice LaFrance
USER>write matcher.ReplaceFirst("Dr.")
Dr. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance

Regular expression that matches phone numbers of the form "(aaa) bbb-cccc" or of the form "aaa-bbb-ccc": (\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b

(\((\d{3})\)\s*|\b(\d{3})-) matches either prefix "(aaa) " or prefix "aaa-". The outer parentheses capture this entire prefix as Group(1) and limits the range of the two prefix subpatterns in alternation by the | operator.

\((\d{3})\)\s* matches prefix "(aaa) "
\( and \) and \s* match "(" and ")" and zero or more spaces, respectively
\d{3} matches exactly 3 digits
(\d{3}) the parentheses capture these 3 digits as Group(2)

\b(\d{3})- matches prefix "aaa-"
\b this "break" allows no other digit or letter immediately before the 3 digits
(\d{3}) captures these 3 digits as Group(3)

(\d{3})- matches "bbb-" and captures these 3 digits as Group(4)

(\d{4}) matches "cccc" and captures these 4 digits as Group(5)

\b this final "break" makes sure the match is not immediately followed by another digit or a letter

ListPhones(s,a) PUBLIC {
    ; a is a reference variable.  On return
    ; a contains the number of phone numbers in string s
    ; a(i) contains just the digits of the i'th phone number
    kill a
    set a = 0
    set m=##class(%Regex.Matcher).%New("(\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b")
    set m.Text = s
    while m.Locate() {
        ; Get first three digits from Group(2) or Group(3)
        if m.Start(2)>0 { set n=m.Group(2) }
        else { set n=m.Group(3) }
        ; Concatenate middle 3 digits and final 4 digits
        set n = n_m.Group(4) _ m.Group(5)
        ; Insert digit string into array a
        set a($increment(a)) = n
    }
}

ListPhones2(s,a) PUBLIC {
    ; a is a reference variable.  On return
    ; a contains the number of phone numbers in string s
    ; a(i) is i'th phone number formatted as "(aaa)bbb-cccc"
    ; Note, no blank after "(aaa)"
    kill a
    set a = 0
    set m=##class(%Regex.Matcher).%New("(\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b")
    set m.Text = s
    while m.Locate() {
        ; Digits are concatentation of Capture groups 2,3,4,5
        ; One of group 2 or 3 is the empty string when group is not used
        set a($increment(a)) = m.SubstituteIn("($2$3)$4-$5")
    }
}

USER>write ^t2
Call 617-555-1212 about item number 61773-333-4569
USER>do ListPhones^ListPhones(^t2,.a)
USER>zwrite a
a=1
a(1)=617555121

USER>write ^t3
Phone (212) 334-5397, (321)770-2121 and 603-646-0110
USER>do ListPhones^ListPhones(^t3,.a)
USER>zwrite a
a=3
a(1)=2123345397
a(2)=3217702121
a(3)=6036460110

USER>write ^t3
Phone (212) 334-5397, (321)770-2121 and 603-646-0110
USER>do ListPhones2^ListPhones(^t3,.a)
USER>zwrite a                         
a=3
a(1)="(212)334-5397"
a(2)="(321)770-2121"
a(3)="(603)646-0110"

Property Inventory

Method Inventory

Properties

property End as %Integer [ MultiDimensional , ReadOnly ];
The property End without a subscript contains the character position in property Text one beyond of the final character of the string found by the last match.

The value of End(i) when subscripted with an integer i between 1 and GroupCount is the character position one beyond the of the last character of the last string successfully captured by capture group i.

The value of End(i) is -1 if capture group i did not participate in the last match. The values of End and End(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and GroupCount, the value of End(0) is identical to the value of End without a subscript. When the property End(...) is subscripted with values not described above then the attempt to evaluate the property End(...) is undefined.

Property methods: EndDisplayToLogical(), EndIsValid(), EndLogicalToDisplay(), EndNormalize()
property Group as %String [ MultiDimensional , ReadOnly ];
The property Group without a subscript contains the string found by the last match.

The value of Group(i) when subscripted with an integer i between 1 and GroupCount is the last string successfully captured by capture group i.

If the last match operation was unsuccessful or if the specified capture group was not used during the last match operation then Group and Group(i) contain the empty string. Note that End and End(i) have negative values when the last match operation did not use the specified capture group or did not succeed in matching.

Note: In addition to integer subscripts between 1 and GroupCount, the value of Group(0) is identical to the value of Group without a subscript. When the property Group(...) is subscripted with values not described above then the attempt to evaluate the property Group(...) is undefined.

Property methods: GroupDisplayToLogical(), GroupIsValid(), GroupLogicalToDisplay(), GroupLogicalToOdbc(), GroupNormalize()
property GroupCount as %Integer [ ReadOnly ];
The property GroupCount contains the number of capturing groups in the regular expression Pattern.
Property methods: GroupCountDisplayToLogical(), GroupCountIsValid(), GroupCountLogicalToDisplay(), GroupCountNormalize()
property HitEnd as %Boolean [ Calculated , ReadOnly ];
The property HitEnd is true if the most recent matching operation touched the end of property Text at any point during its processing. In this case, appending additional input characters to the Text property could change the result of that match attempt.
Property methods: HitEndDisplayToLogical(), HitEndIsValid(), HitEndLogicalToDisplay(), HitEndNormalize()
property OperationLimit as %Integer;
The property OperationLimit provides a way to limit the time taken by a regular expression match. The default value for OperationLimit is 0 which indicates that there is no limit. Setting OperationLimit to a positive integer will cause a match operation to signal a TimeOut error after the specified number of clusters of steps by the match engine.

Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but cluster size is chosen such each cluster's execution time will typically be on the order of milliseconds.

Property methods: OperationLimitDisplayToLogical(), OperationLimitGet(), OperationLimitIsValid(), OperationLimitLogicalToDisplay(), OperationLimitNormalize()
property Pattern as %String;
The property Pattern is the string representation of the regular expression of the Matcher. Assigning to Pattern resets all saved state concerning the last matching operation.

On an installation using an NLS 8-bit character set different from Latin-1 then you you must be careful with patterns using a character class of the form [x-y] where x or y are national usage characters not in Latin-1. All regular expression matching is done in Unicode so characters x and y are converted Unicode. The character class [x-y] reprsents all characters between the Unicode translations of x and y and not the NLS 8-bit characters between x and y.

Property methods: PatternDisplayToLogical(), PatternGet(), PatternIsValid(), PatternLogicalToDisplay(), PatternLogicalToOdbc(), PatternNormalize()
property Start as %Integer [ MultiDimensional , ReadOnly ];
The property Start without a subscript contains the character position in property Text of the first character of the string found by the last match. If the matched string is the empty string then Start is the character position one beyond where the empty string was located (and the property Start equals the property End.)

The value of Start(i) when subscripted with an integer i between 1 and GroupCount is the character position of the first character of the last string successfully captured by capture group i. If the captured string is the empty string then Start(i) is the character position one beyond where the empty string that was captured (and the property Start(i) equals the property End(i).)

The value of Start(i) is -1 if capture group i did not participate in the last match. The values of Start and Start(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and GroupCount, the value of Start(0) is identical to the value of Start without a subscript. When the property Start(...) is subscripted with values not described above then the attempt to evaluate the property Start(...) is undefined.

Property methods: StartDisplayToLogical(), StartIsValid(), StartLogicalToDisplay(), StartNormalize()
property Status as %Status;
The property Status contains a %Status value which may provide more information about the last System exception thrown by this object. It is initially $$$OK. Its value remains unchanged by any successful operation. The Status property is changed only when an error is thrown the kernel functions implementing %Regex.Matcher or by a COS Set assignment to the Status property done by the user.
Property methods: StatusGet(), StatusIsValid(), StatusLogicalToOdbc(), StatusSet()
property Text as %String;
The property Text is the string to which the regular expression will be applied. Assigning to Text resets all saved state resulting from the most recent match operation. On installations using an 8-bit character code, the internal representation of Text is converted to Unicode. Therefore, on an installation using 8-bit characters the maximum length of the Text property is only half the maximum string length supported by that installation.
Property methods: TextDisplayToLogical(), TextGet(), TextIsValid(), TextLogicalToDisplay(), TextLogicalToOdbc(), TextNormalize()

Methods

method EndGet(group As %Integer = 0) as %Integer
The EndGet method implements the End property.
method GroupCountGet() as %Integer
The GroupCountGet method implements the GroupCount property.
method GroupGet(group As %Integer = 0) as %String
The GroupGet method implements the Group property.
method HitEndGet() as %Boolean
The HitEndGet method implements the HitEnd property.
classmethod LastStatus() as %Status
The class method LastStatus returns the %Status value containing additional details about the most recent <REGULAR EXPRESSION> system error. If a %Regex.Matcher object encounters a <REGULAR EXPRESSION> error then this status is already available in the Status property of the object. Executing
Do $SYSTEM.Status.DisplayError(##class(%Regex.Matcher).LastStatus())
is useful when debugging a <REGULAR EXPRESSION> error following a call on $MATCH, $LOCATE or ##class(%Regex.Matcher).%New(x) where a %Regex.Matcher oref value is not available.
method Locate(position As %Integer) as %Boolean
The method Locate finds a match for the regular expression Pattern in the text string Text.

If the optional argument position is defined as an integer 1 or greater then the search for a match begins at that character position of Text.

If the argument position is not defined then the search for the match begins the character position following the previous match.

Locate returns 1 if the match is found; 0 otherwise.

method LookingAt(position As %Integer = 1) as %Boolean
The method LookingAt attempts to find a match in the property Text that must start at a particular character position. The match need not extend to the end of Text.

The argument position gives starting character position of the attempted match.

LookingAt returns 1 if the match is found; 0 otherwise.

method Match(text As %String) as %Boolean
The method Match returns true if the entire string Text is matched by Pattern; it returns false if it does not match.

The argument text is optional. If the argument text is defined then the property Text is set to its value before the match is executed.

method OperationLimitSet(limit) as %Status
The OperationLimitSet method implements the side effects of doing a Set assignment to change the value of the OperationLimit property.
method PatternSet(pattern As %String) as %Status
The PatternSet method implements Set assignments to the Pattern property.
method ReplaceAll(replacement As %String) as %String
The method ReplaceAll returns a modified copy of the property Text. It replaces every substring of Text that matches the Pattern with a replacement string. Portions of Text that are not matched are copied without change. The value of ReplaceAll is the resulting string. The property Text is not modified.

The argument replacement supplies the string to replace each matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

method ReplaceFirst(replacement As %String) as %String
The method ReplaceFirst returns a modified copy of the property Text. It replaces the first substring of Text that matches the Pattern with a replacement string. Portions of Text that are not matched are copied without change. The value of ReplaceFirst is the resulting string. The property Text is not modified.

The argument replacement supplies the string to replace the matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

method RequiredPrefixGet() as %String
The RequiredPrefixGet method implements the RequiredPrefix property.
method ResetPosition(position As %Integer = 1)
The method ResetPosition resets any saved state from the previous match. It also causes the next call to the method Locate()() without an argument to begin at the specified character position.

The argument position is the character position from which the next call to Locate()() without an argument will begin match attempts.

method StartGet(group As %Integer = 0) as %Integer
The StartGet method implements the Start property.
method SubstituteIn(text As %String) as %String
The method SubstituteIn returns the string that results from substituting capturing groups from the most recent regular expression match into components of the argument Text. This method is undefined if the most recent regular expression match operation was not successful.

This method can be used as a low level step in regular expression replacement. It does not modify the property Text. For example, the method ..ReplaceFirst()(x) is equivalent to:

Quit:'..Locate(1) ..Text
Quit $Extract(..Text,1,..Start-1)_..SubstituteIn(x)_
         $Extract(..Text,..End,*)

The argument Text supplies the string that will be modified by the matched region and then returned. The string may contain references to capture groups which take the form of $1, $2, etc. The string may reference the entire matched region with $0.

method TextSet(text As %String) as %Status
The TextSet method implements Set assignments to the Text property.

Inherited Members

Inherited Methods

FeedbackOpens in a new tab