|field||A data column whose values are being compared with the document text. Must be of data type %TEXT. Cannot be a view field.|
|document||An alphabetic string to match with values in field. A document consists of a series of words separated by a delimiter (commonly, the space character).|
The %SIMILARITY function returns a numeric value indicating the similarity between each value of field and the text specified in document. The fractional values returned range from 0 (no similarity at all) to 1 (identical). The returned value is of type NUMERIC with a precision of 19 and a scale of 18.
You can use %SIMILARITY as a select-item or in a predicate in a WHERE clause. You can use %SIMILARITY to determine order of ranking, for example in an ORDER BY clause.
%SIMILARITY can be used on a %Text string or a character stream field.
To use %SIMILARITY on a string, change the %String property to %Text, and set LANGUAGECLASS and MAXLEN property parameters. For example:
Property MySentences As %Text(LANGUAGECLASS = "%Text.English",MAXLEN = 1000);
Specifying a MAXLEN value (in bytes) is required for %Text properties.
To use %SIMILARITY on a character stream field, the stream field must be defined as type %Stream.GlobalCharacterSearchable. For example:
Property MyTextStream As %Stream.GlobalCharacterSearchable(LANGUAGECLASS = "%Text.English");
The available languages are English, French, German, Italian, Japanese, Portuguese, and Spanish. See the %Text package class documentation (in %SYS) in the InterSystems Class Reference for further details.
If field is neither data type %Text nor %Stream.GlobalCharacterSearchable, the system generates an SQLCODE -309 error.
%SIMILARITY has both an indexed and a non-indexed implementation. For both %Text and %Stream.GlobalCharacterSearchable, you can, optionally, set the SIMILARITYINDEX property parameter. If no SIMILARITYINDEX is specified, Caché uses a non-indexed (and much slower) implementation that takes the maximum similarity of any 32k chunk of the document. Since the similarity metric takes document length into account, the similarity calculated in this way is different (and usually larger) than it would be for the document as a whole. Also, since chunks do not overlap, similar terms that appear across the chunk boundary do not contribute as much to similarity as they would for the document as a whole, which acts to reduce the similarity value. In contrast, a %SIMILARITY value that is based on a SIMILARITYINDEX is not chunked, and is therefore based on the document as a whole. For both performance and consistency it is recommended that you should set up a similarity index if you need to use %SIMILARITY on streams.
If text is represented as a stream that is greater than the maximum length of a string and a search on the text uses %SIMILARITY on a non-indexed field, the document is broken up into chunks of characters. If non-indexed fields span boundaries between chunks, they may not be properly referenced. To avoid this issue, only use %SIMILARITY on indexed fields. For information on the maximum length of a string, see the section “Support for Long String Operations” in the chapter “Server Configuration Options” in the Caché Programming Orientation Guide.
For further details on SIMILARITYINDEX see the %Library.Text class.
A returned value of equality (1.00000) means that the field value and document string consist of the same words. Two words are considered identical if they have the same stem form; for example, dog=dogs and jump=jumped=jumping. The words in field and document may be in a completely different order. By default, word comparison is not case-sensitive.
The similarity of two identical strings may be very slightly less than or very slightly more than exactly 1.
A returned value of highly similar generally means that most or all of the words (or other delimited data items) in field are also found in document, though not necessarily in the same stem form or order. The document text may also contain words not present in field. Extra words in document that are not present in field have less effect on similarity than words missing from document that are present in field. One or two duplicates in document of a word present in field generally add to the degree of significance, but large numbers of duplicates in document diminish significance. One-letter and two-letter words have less effect on significance than longer words.
%SIMILARITY comparison is governed by the class parameters of the %Text.Text system class, found in the %SYS namespace. These parameters allow you to specify, among other things, whether comparison is to be case-sensitive or not case-sensitive, and the treatment of numbers, punctuation characters, and multi-word phrases.
Caché can use specific language analysis rules, including common word analysis (“noise word” lists) and stemming rules, to determine similarity. The available languages are English, French, German, Italian, Japanese, Portuguese, and Spanish.
For a much more detailed treatment of %SIMILARITY and %Text, refer to the %Text package class documentation in the InterSystems Class Reference.
iKnow and iFind
The Caché iKnow text analysis tool and iFind text search tool also provide similarity analysis. These facilities are entirely separate from %Text classes. They provide a substantially different and significantly more sophisticated form of textual analysis.
The following example returns the top 10 records that match the %SIMILARITY string value. Because the most similar matches have the highest similarity value, the ORDER BY clause here is DESC (in descending order):
SELECT TOP 10 MySentences FROM Sample.MyTexts ORDER BY %SIMILARITY(MySentences,'the quick brown fox jumped') DESC