Returns a number indicating the similarity of a field value to a text.
||A data column whose values are being compared with the document text. Must be of data type %TEXT. Cannot be a view field.
||An alphabetic string to match with values in field. A document consists of a series of words separated by a delimiter (commonly, the space character).
function returns a numeric value indicating the similarity between each value of field
and the text specified in document
. The fractional values returned range from 0 (no similarity at all) to 1 (identical). The returned value is of type NUMERIC with a precision of 19 and a scale of 18.
can be used on a %Text string or a character stream field.
Property MySentences As %Text(LANGUAGECLASS = "%Text.English",MAXLEN = 1000);
Specifying a MAXLEN value (in bytes) is required for %Text
Property MyTextStream As %Stream.GlobalCharacterSearchable(LANGUAGECLASS = "%Text.English");
The available languages are English, French, German, Italian, Japanese, Portuguese, and Spanish. See the %Text
package class documentation (in %SYS) in the InterSystems Class Reference
for further details.
is neither data type %Text nor %Stream.GlobalCharacterSearchable, the system generates an SQLCODE -309 error.
has both an indexed and a non-indexed implementation. For both %Text and %Stream.GlobalCharacterSearchable, you can, optionally, set the SIMILARITYINDEX property parameter. If no SIMILARITYINDEX is specified, InterSystems IRIS uses a non-indexed (and much slower) implementation that takes the maximum similarity of any 32k chunk of the document. Since the similarity metric takes document length into account, the similarity calculated in this way is different (and usually larger) than it would be for the document as a whole. Also, since chunks do not overlap, similar terms that appear across the chunk boundary do not contribute as much to similarity as they would for the document as a whole, which acts to reduce the similarity value. In contrast, a %SIMILARITY
value that is based on a SIMILARITYINDEX is not chunked, and is therefore based on the document as a whole. For both performance and consistency it is recommended that you should set up a similarity index if you need to use %SIMILARITY
If text is represented as a stream that is greater than the maximum length of a string and a search on the text uses %SIMILARITY
on a non-indexed field, the document is broken up into chunks of characters. If non-indexed fields span boundaries between chunks, they may not be properly referenced. To avoid this issue, only use %SIMILARITY
on indexed fields. For further details, refer to Maximum String Length
in the Data Types and Values
chapter of Using ObjectScript
A returned value of equality (1.00000) means that the field
value and document
string consist of the same words. Two words are considered identical if they have the same stem form; for example, dog=dogs and jump=jumped=jumping. The words in field
may be in a completely different order. By default, word comparison is not case-sensitive.
The similarity of two identical strings may be very slightly less than or very slightly more than exactly 1.
A returned value of highly similar generally means that most or all of the words (or other delimited data items) in field
are also found in document
, though not necessarily in the same stem form or order. The document
text may also contain words not present in field
. Extra words in document
that are not present in field
have less effect on similarity than words missing from document
that are present in field
. One or two duplicates in document
of a word present in field
generally add to the degree of significance, but large numbers of duplicates in document
diminish significance. One-letter and two-letter words have less effect on significance than longer words.
comparison is governed by the class parameters of the %Text.Text
system class, found in the %SYS namespace. These parameters allow you to specify, among other things, whether comparison is to be case-sensitive or not case-sensitive, and the treatment of numbers, punctuation characters, and multi-word phrases.
InterSystems IRIS can use specific language analysis rules, including common word analysis (noise word lists) and stemming rules, to determine similarity. The available languages are English, French, German, Italian, Japanese, Portuguese, and Spanish.
The following example returns the top 10 records that match the %SIMILARITY
string value. Because the most similar matches have the highest similarity value, the ORDER BY clause here is DESC (in descending order):
SELECT TOP 10 MySentences FROM Sample.MyTexts ORDER BY %SIMILARITY(MySentences,'the quick brown fox jumped') DESC