%Regex.Matcher
class %Regex.Matcher extends %Library.RegisteredObject, %SYSTEM.Help
The Class %Regex.Matcher creates an object that does pattern matching using regular expressions. The regular expressions come from the International Components for Unicode (ICU). The ICU maintains web pages at https://icu.unicode.orgOpens in a new tab.The definition and features of the ICU regular expression package can be found in https://unicode-org.github.io/icu/userguide/strings/regexp.htmlOpens in a new tab.
On most platforms, installing InterSystems IRIS will also install an appropiate version of the ICU libraries. On platforms that do not have an ICU library available, evaluating any regular expression function or method will result in an <UNIMPLEMENTED> error.
A %Regex.Matcher object can be created by evaluating
##class(%Regex.Matcher).%New(pattern) or
##class(%Regex.Matcher).%New(pattern,text).
The first parameter to %New() becomes the inital value
of the property Pattern. The optional, second parameter
to %New() become the inital value of the property Text. Setting
property Pattern to a regular expression pattern string
causes that regular expression pattern to be compiled into a
Matcher object where it can be used to do multiple matching operations
without being recompiled. The property Text contains the
subject text string that is searched by a regular expressions match.
Note that an empty string is considered to be an illegal regular
expression so the first parameter to %New() cannot be missing nor be the
empty string.
If x is a %Regex.Matcher object then the built-in method %ConstructClone() can be used to copy x ( Set xnew = x.%ConstructClone() ) . The state of the most recent match and any error value in the Status property are not cloned. The %ConstructClone() method can be faster than creating a new Matcher with the same Pattern. The %ConstructClone() method can just copy instructions for the matching engine rather than recompiling the original pattern string. On 8-bit systems %ConstructClone() can just copy the Unicode versions of the Pattern and Text properties without need to do the character-by-character conversion from the NLS 8-bit character set into Unicode.
None of the methods or operations in the %Regex.Matcher package return a %Status value. When an error is detected, these operations always throw the system exception thrown by the kernel code that interfaces to the ICU library. If a program wants to recover from a regular expression error then it is recommended that the code doing regular expression operations be surrounded with a TRY {...} block and that the error recovery be done in the corresponding CATCH {...} block. Note that a TRY block imposes no run-time performance overhead in situations where no error occurs.
The methods and operations in a %Regex.Matcher object will catch any
<REGULAR EXPRESSION> system error and will generate a %Status value
that may better describe that error. That %Status value will be stored
in the Status property of the %Regex.Matcher object and in the
variable %objlasterror. After saving the %Status value, the
original unmodified
<REGULAR EXPRESSION> system exception will be rethrown. You may
examine that %Status value by executing the following InterSystems IRIS Object
Script command:
do $system.Status.DisplayError(%objlasterror)
Some other system errors, like <STRING STACK>, are passed through the %Regex.Matcher methods without modification.
Note that some ICU operation errors are not considered errors by the %Regex.Matcher package. Examples are evaluating the Start and End properties when the previous matching operation failed. In these cases Start and End have value -2 as a character position rather than throwing an error.
Examples:
Regular expression that finds titles M., Mr., Mrs. and Ms. in a string: "\bMr?s?\."
"\b" matches a break at the beginning (or ending) of a word
"M" matches an upper-case letter-M
"r?" matches 0 or 1 occurences of a lower-case letter-r
"s?" matches 0 or 1 occurences of a lower-case letter-s
"\." matches a period character
USER>set matcher=##class(%Regex.Matcher).%New("\bMr?s?\.") USER>set matcher.Text="Mrs. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance" USER>while matcher.Locate() {write "Found ",matcher.Group," at position ",matcher.Start,!} Found Mrs. at position 1 Found Mr. at position 19 Found Ms. at position 37 Found M. at position 54 USER>write matcher.ReplaceAll("Dr.") Dr. Sally Jones, Dr. Mike McMurry, Dr. Amy Johnson, Dr. Maurice LaFrance USER>write matcher.ReplaceFirst("Dr.") Dr. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance
Regular expression that matches phone numbers of the form "(aaa) bbb-cccc"
or of the form "aaa-bbb-ccc": (\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b
(\((\d{3})\)\s*|\b(\d{3})-) matches either
prefix "(aaa) " or prefix "aaa-". The outer
parentheses capture this entire prefix as Group(1) and limits the range of
the two prefix subpatterns in alternation by the | operator.
\((\d{3})\)\s* matches prefix "(aaa) "
\( and \) and \s* match "(" and ")" and zero or more spaces, respectively
\d{3} matches exactly 3 digits
(\d{3}) the parentheses capture these 3 digits as Group(2)
\b(\d{3})- matches prefix "aaa-"
\b this "break" allows no other digit or letter immediately before the 3 digits
(\d{3}) captures these 3 digits as Group(3)
(\d{3})- matches "bbb-" and captures these 3 digits as Group(4)
(\d{4}) matches "cccc" and captures these 4 digits as Group(5)
\b this final "break" makes sure the match is not immediately followed
by another digit or a letter
ListPhones(s,a) PUBLIC { ; a is a reference variable. On return ; a contains the number of phone numbers in string s ; a(i) contains just the digits of the i'th phone number kill a set a = 0 set m=##class(%Regex.Matcher).%New("(\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b") set m.Text = s while m.Locate() { ; Get first three digits from Group(2) or Group(3) if m.Start(2)>0 { set n=m.Group(2) } else { set n=m.Group(3) } ; Concatenate middle 3 digits and final 4 digits set n = n_m.Group(4) _ m.Group(5) ; Insert digit string into array a set a($increment(a)) = n } } ListPhones2(s,a) PUBLIC { ; a is a reference variable. On return ; a contains the number of phone numbers in string s ; a(i) is i'th phone number formatted as "(aaa)bbb-cccc" ; Note, no blank after "(aaa)" kill a set a = 0 set m=##class(%Regex.Matcher).%New("(\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b") set m.Text = s while m.Locate() { ; Digits are concatentation of Capture groups 2,3,4,5 ; One of group 2 or 3 is the empty string when group is not used set a($increment(a)) = m.SubstituteIn("($2$3)$4-$5") } } USER>write ^t2 Call 617-555-1212 about item number 61773-333-4569 USER>do ListPhones^ListPhones(^t2,.a) USER>zwrite a a=1 a(1)=617555121 USER>write ^t3 Phone (212) 334-5397, (321)770-2121 and 603-646-0110 USER>do ListPhones^ListPhones(^t3,.a) USER>zwrite a a=3 a(1)=2123345397 a(2)=3217702121 a(3)=6036460110 USER>write ^t3 Phone (212) 334-5397, (321)770-2121 and 603-646-0110 USER>do ListPhones2^ListPhones(^t3,.a) USER>zwrite a a=3 a(1)="(212)334-5397" a(2)="(321)770-2121" a(3)="(603)646-0110"
Property Inventory
Method Inventory
- EndGet()
- GroupCountGet()
- GroupGet()
- HitEndGet()
- LastStatus()
- Locate()
- LookingAt()
- Match()
- OperationLimitSet()
- PatternSet()
- ReplaceAll()
- ReplaceFirst()
- RequiredPrefixGet()
- ResetPosition()
- StartGet()
- SubstituteIn()
- TextSet()
Properties
The value of End(i) when subscripted with an integer i between 1 and GroupCount is the character position one beyond the of the last character of the last string successfully captured by capture group i.
The value of End(i) is -1 if capture group i did not participate in the last match. The values of End and End(i) are -2 if the last match attempt failed.
Note: In addition to integer subscripts between 1 and GroupCount, the value of End(0) is identical to the value of End without a subscript. When the property End(...) is subscripted with values not described above then the attempt to evaluate the property End(...) is undefined.
The value of Group(i) when subscripted with an integer i between 1 and GroupCount is the last string successfully captured by capture group i.
If the last match operation was unsuccessful or if the specified capture group was not used during the last match operation then Group and Group(i) contain the empty string. Note that End and End(i) have negative values when the last match operation did not use the specified capture group or did not succeed in matching.
Note: In addition to integer subscripts between 1 and GroupCount, the value of Group(0) is identical to the value of Group without a subscript. When the property Group(...) is subscripted with values not described above then the attempt to evaluate the property Group(...) is undefined.
Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but cluster size is chosen such each cluster's execution time will typically be on the order of milliseconds.
On an installation using an NLS 8-bit character set different from Latin-1 then you you must be careful with patterns using a character class of the form [x-y] where x or y are national usage characters not in Latin-1. All regular expression matching is done in Unicode so characters x and y are converted Unicode. The character class [x-y] reprsents all characters between the Unicode translations of x and y and not the NLS 8-bit characters between x and y.
The value of Start(i) when subscripted with an integer i between 1 and GroupCount is the character position of the first character of the last string successfully captured by capture group i. If the captured string is the empty string then Start(i) is the character position one beyond where the empty string that was captured (and the property Start(i) equals the property End(i).)
The value of Start(i) is -1 if capture group i did not participate in the last match. The values of Start and Start(i) are -2 if the last match attempt failed.
Note: In addition to integer subscripts between 1 and GroupCount, the value of Start(0) is identical to the value of Start without a subscript. When the property Start(...) is subscripted with values not described above then the attempt to evaluate the property Start(...) is undefined.
Methods
Do $SYSTEM.Status.DisplayError(##class(%Regex.Matcher).LastStatus())
is useful when debugging a <REGULAR EXPRESSION> error following a call on $MATCH, $LOCATE or ##class(%Regex.Matcher).%New(x) where a %Regex.Matcher oref value is not available.
If the optional argument position is defined as an integer 1 or greater then the search for a match begins at that character position of Text.
If the argument position is not defined then the search for the match begins the character position following the previous match.
Locate returns 1 if the match is found; 0 otherwise.
The argument position gives starting character position of the attempted match.
LookingAt returns 1 if the match is found; 0 otherwise.
The argument text is optional. If the argument text is defined then the property Text is set to its value before the match is executed.
The argument replacement supplies the string to replace each matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.
The argument replacement supplies the string to replace the matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.
The argument position is the character position from which the next call to Locate()() without an argument will begin match attempts.
This method can be used as a low level step in regular expression replacement. It does not modify the property Text. For example, the method ..ReplaceFirst()(x) is equivalent to:
Quit:'..Locate(1) ..Text
Quit $Extract(..Text,1,..Start-1)_..SubstituteIn(x)_
$Extract(..Text,..End,*)
The argument Text supplies the string that will be modified by the matched region and then returned. The string may contain references to capture groups which take the form of $1, $2, etc. The string may reference the entire matched region with $0.
Inherited Members
Inherited Methods
- %AddToSaveSet()
- %ClassIsLatestVersion()
- %ClassName()
- %ConstructClone()
- %DispatchClassMethod()
- %DispatchGetModified()
- %DispatchGetProperty()
- %DispatchMethod()
- %DispatchSetModified()
- %DispatchSetMultidimProperty()
- %DispatchSetProperty()
- %Extends()
- %GetParameter()
- %IsA()
- %IsModified()
- %New()
- %NormalizeObject()
- %ObjectModified()
- %OriginalNamespace()
- %PackageName()
- %RemoveFromSaveSet()
- %SerializeObject()
- %SetModified()
- %ValidateObject()
- Help()