docs.intersystems.com
Home  /  Application Development: Analytics Options  /  Using InterSystems SQL Search  /  InterSystems SQL Search Tool


Using InterSystems SQL Search
InterSystems SQL Search Tool
[Back]  [Next] 
InterSystems: The power behind what matters   
Search:  


This chapter describes the InterSystems SQL Search facility, a tool for performing context-aware text search operations. To use InterSystems SQL Search you must define an SQL Search index for each column containing text that you wish to search. You can then search the text records using a standard SQL query with a WHERE clause containing InterSystems SQL Search syntax. The query will return all records that contain the specified search item or items. You can also highlight the matching text in the returned records.
Before You Begin
You need to have an InterSystems IRIS Data Platform™ instance that is up and running and has an active InterSystems IRIS license key that provides access to the InterSystems IRIS Natural Language Processor (NLP). (You can view the licence key from the Management Portal: select System Administration then Licensing.)
Examples in this documentation use data from the Aviation.Event SQL table. If you wish to use this sample data, it is available at https://github.com/intersystems/Samples-Aviation. (You do not need to know anything about GitHub or have a GitHub account.) To install these samples, InterSystems recommends that you create a dedicated namespace called (for example) TESTSAMPLES and then load the samples into that namespace (or you can use an existing namespace; however, you cannot use the %SYS namespace). To create a namespace, use the Management Portal options System Administration -> Configuration -> System Configuration -> Namespaces. For the general process of downloading from GitHub, see Downloading Samples for Use with InterSystems IRIS. After you download a sample, be sure to open the README file and follow the setup instructions.
Indexing Sources for SQL Search
You can use SQL Search to search text in %String data type or %Stream.GlobalCharacter (character stream) data type.
To perform a SQL search, the column to be searched must have a defined SQL Search bitmap index. There are three levels of SQL Search indices. These levels are defined in nested subclasses. Each index level provides all of the features of the previous level, plus additional SQL Search features specific to that level. You can create any of the following SQL Search index types:
Each index level supports all of the parameters of the previous level, and adds one or more additional parameters. Unspecified parameters take default values.
The following Class Definition example creates a table with a Semantic index on the Narrative property (column). The indexed property can be of data type %String or %Stream.GlobalCharacter:
   Class Aviation.TestSQLSrch Extends %Persistent [ ClassType=persistent,
      DdlAllowed,Owner={UnknownUser},ProcedureBlock,SqlRowIdPrivate,
      SqlTableName=TestSQLSrch ]
  { 
  Property UniqueNum As %Integer;
  Property CrashDate As %TimeStamp [ SqlColumnNumber=2 ];
  Property Narrative As %String(MAXLEN=100000) [ SqlColumnNumber=3 ];
  Index NarrSemanticIdx On (Narrative) As %iFind.Index.Semantic(INDEXOPTION=0,
     LANGUAGE="en",LOWER=1);
  Index UniqueNumIdx On UniqueNum [ Type=index,Unique ];
  }
An SQL Search index of any type includes support for the following parameters:
For a full list of supported parameters refer to %iFind.Index.Basic in the InterSystems Class Reference.
A Semantic index (%iFind.Index.Semantic) also supports the following optional parameter:
Indexing a JSON Object
You can create an InterSystems SQL Search index for text stored in a JSON object. This index specifies the starting position in the JSON structure. SQL search recursively indexes all text at that level and all nested levels below it. Specify $ to index the entire JSON object. Specify $.key2 to index the JSON values at key2 and below.
Populating a Table
Like any SQL index, a defined SQL Search index (by default) is built when you populate a new table, and maintained when you subsequently insert, update, or delete data. You can defer building of an index when populating a table using %NOINDEX, and then use the %Build() method to build the index. You can add an index to a table that is already populated with data and then build that index. Refer to Defining and Building Indices for further details.
The following example populates the Aviation.TestSQLSrch table from the Aviation.Events table. Any defined SQL Search indices will automatically be built. This example inserts a large amount of text, so running it may take a minute or so:
INSERT OR UPDATE INTO Aviation.TestSQLSrch (UniqueNum,CrashDate,Narrative) 
    SELECT %ID,EventDate,NarrativeFull FROM Aviation.Event
This example uses INSERT OR UPDATE with a field defined with a unique key to prevent repeated execution from creating duplicate records.
Performing SQL Search
You use SQL Search syntax in an SQL query WHERE clause to perform a text search for one or more text items. These text items may be words or sequences of words (Basic index) or NLP semantic entities (Semantic index). Multiple text items are an implicit AND search; all of the specified items must appear in the text, in any order. The syntax for SQL Search is as follows:
WHERE %ID %FIND search_index(indexname,'search_items',search_option,'language','synonym_tables')
When performing an Basic index search, SQL Search identifies words by the presence of one or more space characters. Sentence punctuation (a period, comma, semicolon, or colon followed by a space) is ignored. SQL Search treats all other punctuation as literals. For example, SQL Search treats “touch-and-go” as a single word. Punctuation such as hyphens or a decimal point in a number are treated as literals. Quote characters and apostrophes must be specified. You specify a single quote character by doubling it.
You can perform any Basic index search (word, co-occurrence, or positional phrase) with a Semantic index. Attempting to perform a Semantic index search with a Basic index results in an SQLCODE -149 error.
SQL search_items Syntax
Basic index search_items can contain the following syntax:
Word Search:
word1 word2 word3 Specifies that these exact words must appear (in any order) somewhere in the text. (Logical AND). You can specify a single word, or any number of words separated by spaces.
word1 OR word2 NOT word3
word1 OR (word2 AND word3)
search_items can contain AND, OR, and NOT logical operators. AND is the same as separating words with spaces (implicit AND). NOT is logically equivalent to AND NOT. search_items can also use parentheses to group logical operators. Explicit AND is needed when specifying multiple words in grouping parentheses: (word2 AND word3). If the explicit AND was omitted, (word2 word3) would be interpreted as a positional phrase. You can use the \ escape character to specify AND, OR, NOT as literals rather than logical operators: \and
*word
word*
*word*
w*d
An asterisk wildcard specifies 0 or more non-space characters of any type. An asterisk can be used as a prefix, suffix, or within a word. You can use \ escape character to specify * as a literal: \*
Co-occurrence Word Search:
[word1,word2,...,range]
Co-occurrence search. Specifies that these exact words must appear (in any order) within the proximity window specified by range. You can specify any number of words or multi-word phrases. A multi-word phrase is specified as words separated by spaces with no delimiting punctuation. Words (or positional phrases) are separated by commas, the last comma-separated element is an optional numeric range. Words can specify asterisk wildcards.
A range can be specified as min–max or simply as max with a default min of 1. For example, 1–5 or 5. range is optional; if omitted, it defaults to 1–20. A range count is inclusive of all of the specified words.
Co-occurrence search cannot be used with search_option=4 (Regular Expressions).
Positional Phrase Search:
Note:
You can use double quotes "word1 word2 word3" or parentheses (word1 word2 word3) to delimit a positional phrase. Because parentheses are also used to group logical operators, the use of double quotes is preferred.
"word1 word2 word3"
These exact words must appear sequentially in the specified order. Words are separated by spaces. Note that no semantic analysis is performed; for example, the words in a “phrase” may be the final word of a sentence and the beginning words of the next sentence. Asterisk wildcards can be applied to individual words in a phrase. A literal parentheses character in the search_items must be enclosed with quotes.
"word1 ? word3"
"word1 ? ? ? word5"
A question mark indicates that exactly one word is found between the specified words in a phrase. You can specify multiple single question marks, each separated by spaces.
"word1 ?? word6"
A double question mark (with no space between) indicates that from 0 to 6 words are found between the specified words in a phrase.
"word1 [1–3] word5"
Square brackets indicate an interval number of words between the specified words in a phrase: min-max. This interval is specified as a variable range, in this case from 1 to 3 missing words.
Semantic index search_items can contain the following NLP entity search syntax in addition to the Basic index syntax:
Full Entity and Partial Entity Search:
{entity} Specifies the exact wording of a NLP entity. Asterisk wildcards can be applied to individual words in an entity.
<{entity} A less-than sign prefix specifies an NLP entity ending with the specified word(s). There must be one or more words in the entity appearing before the specified word(s).
{entity}> A greater-than sign suffix specifies an NLP entity beginning with the specified word(s). There must be one or more words in the entity appearing after the specified word(s).
Multiple search_items can be specified, separated by spaces. This is an implicit AND test. For example:
SELECT Narrative FROM Aviation.TestSQLSrch WHERE %ID %FIND 
search_index(NarrSemanticIdx,'<{plug electrode} "flight plan" instruct*',0,'en')
means that a Narrative text must include one or more SQL Search entities that end with “plug electrode”, AND the positional phrase “flight plan”, AND the word “instruct” with a wildcard suffix, allowing for “instructor”, “instructors”, “instruction”, “instructed”, etc. These items can appear in any order in the text.
Validating an SQL search-items String
You can use the %iFind.Utils.TestSearchString() method to validate a search_items string. This method enables you to detect syntax errors and ambiguous use of logical operators. For example, "word1 AND word2 OR word3" fails validation because it is logically ambiguous. Adding parentheses clarifies this string to either "word1 AND (word2 OR word3)" or "(word1 AND word2) OR word3".
The following example invokes this SQL Search utility as an SQL function:
SELECT %iFind.TestSearchString('orange AND (lemon OR lime)')
TestSearchString() returns a %Status value: A valid search_items string returns a status of 1. An invalid search_items string returns an object expression that begins with 0, followed by encoded error information.
Fuzzy Search
InterSystems SQL Search supports fuzzy search to match records containing elements (words or entities) that “almost” match the search string. Fuzzy search can be used to account for small variations in writing (color vs. colour), misspellings (collor vs color), and different grammatical forms (color vs. colors).
SQL Search evaluates fuzzy matches by comparing the Levenshtein distance between the two words. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. The maximum number of single-character edits required is know as the maximum edit distance. The InterSystems SQL Search maximum edit distance defaults to 2 characters. The maximum edit distance is separately applied to each element in the search string. For SQL Search Basic index, it is applied to each word in the search string. For SQL Search Semantic index, it is applied to each NLP entity in the search string. (The examples that follow assume an SQL Search Basic index.)
For example, the phrase “analyse programme behaviour” is a fuzzy search match for “analyze program behavior” when maximum edit distance=2, because each word in the search string differs by an edit distance of (at most) 2 characters: analyse=analyze (1 substitution), programme=program (2 deletions), behaviour=behavior (1 deletion).
A word with that is lesser than or equal to the maximum edit distance is a fuzzy search match for any word with an equal or lesser number of characters. For example, if the edit distance is 2, the word “ab” would match any two-letter word (2 substitutions), any one-letter word (1 substitution, 1 deletion), any three-letter word containing either “a” or “b” (1 substitution, 1 insertion), and any four-letter word containing both “a” and “b” in that order (2 insertions).
To activate fuzzy search for search_index() specify search_option as 3 for fuzzy search with the default edit distance of 2, or 3:n for fuzzy search with an edit distance specified as n characters. The following example shows SQL Search with fuzzy search with an edit distance of 4:
SELECT Narrative FROM Aviation.TestSQLSrch WHERE %ID %FIND 
search_index(NarrBasicIdx,'"color code" program',3:4,'en')
Setting 3:1 sets the edit distance=1, which in English is appropriate for matching most (but not all) singular and plural words. Setting 3:0 sets the edit distance=0, which is the same as SQL Search without fuzzy search.
To specify fuzzy search for SQL Search methods, set the pSearchOption = $$$IFSEARCHFUZZY.
Stemming and Decompounding
Basic index, Semantic index, and Analytic index can all support stemming and decompounding. Stemming and decompounding are word-based, not NLP entity-based operations. You must enable stemming and decompounding when you define an SQL Search index. To enable an index for stem-aware searches, specify INDEXOPTION=1; to enable both stem-aware searches and decompounding-aware searches, specify INDEXOPTION=2.
If an SQL Search index was defined to support stemming (1) or stemming and decompounding (2), you can use these facilities in a search_index() query by setting the search_option value.
Stemming
Stemming identifies the Stem Form of each word. The stem form unifies multiple grammatical forms of the same word. When using search_option=1 at query time, SQL Search performs search and match operations using the Stem Form of a word, rather than the actual text form. By using search_option=0, you can use the same index for regular (non-stemmed) searches.
Decompounding
Decompounding divides up compound words into their constituent words. SQL Search always combines decompounding with stemming; once a word is divided into its constituent parts, each of these parts is automatically stemmed. When using a decompounding search (search_option=2), SQL Search compares the decompounding stems of the search term with the decompounded stems of the words in the indexed text field. SQL Search matches a decompounded word only when the stems of any of its constituent words match all constituent words of the search term.
For example, the search terms “thunder”, “storm”, or “storms” would all match the word “thunderstorms”. However, the search term “thunderstorms” would not match the word “thunder”, because its other constituent word (“storm”) is not matched.
The InterSystems SQL Search decompounding algorithm using a language-specific dictionary that identifies possible constituent words. This dictionary should be populated through the %iKnow.Stemming.DecompoundingUtils class. For example, by pointing it to a text column prior to indexing. You may also wish to exempt specific words from decompounding. You can exempt individual words, character sequences, and training data word lists from decompounding using %iKnow.Stemming.DecompoundUtils.
Languages Not Supported by the InterSystems IRIS Natural Language Processor
You can use SQL Search Basic indices to index and search texts in languages for which there is no corresponding NLP language model.
Because stemming is not dependent on NLP semantic indexing, you can also perform Basic index word searches on stem forms of words, if a stemmer is available. You must specify INDEXOPTION=1 or INDEXOPTION=2 to perform stem searches. For example, Italian is not an NLP-supported language, but InterSystems IRIS provides a %Text stemmer for Italian.
The following limitations and caveats apply to SQL Search with languages not supported by NLP:
Synonym Tables
To implement a synonym table, define the table as a persistent class that extends the iFind.Synonym abstract class.
This class defines two properties, FromTerm and ToTerm. A pair of FromTerm and ToTerm properties define ToTerm as a synonym for FromTerm. SQL Search would use ToTerm to expand the query if the query contains FromTerm.
The query uses the GetMatch() method of this class to search the synonyms in the synonym table against the query terms.
During query execution, SQL Search checks if any synonyms exist for a single word unit or a positional search phrase. For example, two synonym pairs ("persons","people") and ("walk","run") are defined in a synonym table. A SQL Search query is executed on the phrase "persons walk". If the synonym table is associated with the query, SQL Search returns not only documents matching the original query, it also returns documents matching any one of the queries: "persons run", "people walk" and "people run".
However, if the search_items string is '"persons walk"', query expansion would not happen, because SQL Search does not expand any word in a positional phrase search. The positional phrase itself is the minimum unit for query expansion. If, however, you define a synonym pair like ("persons walk","persons walk and run"), SQL Search would expand the query '"persons walk"' to '"persons walk and run"'. SQL Search treats a ToTerm as a positional phrase if it contain multiple words. A ToTerm can be any valid positional phrase; it can contain * or ? wildcards.
Note:
Synonym tables cannot be used with Regular Expression search (search_option=4).
Highlighting
You can highlight words in a returned text using the search_items syntax. Highlighting syntax is:
(text,search_items,search_option)
search_items: Highlighting uses the same search_items syntax as searching. This allows you to use the same search_items value for both returning records and highlighting the strings within those records that caused them to be returned. This also allows you to use the TestSearchString() method to validate highlighting search_items syntax. However, because highlighting is applied to every instance of every match, highlighting ignores the search_items syntax AND, OR, and NOT logical operators in a search_items string.
search_option: The optional search_option can be 0 (the default) or 4 (Regular Expressions).
You can apply highlighting using either of the following:
By default, highlighting inserts the <b> and </b> (bold) XML tags at the appropriate places in the string. By default, highlighting is not case sensitive.
Highlighting can be used with any search_option, including Regular Expression search (search_option=4), as shown in the following example:
  SET x="Time flies like an arrow. other stuff. Fruit flies like a banana."
  WRITE ##class(%iFind.Utils).Highlight(x,"\p{LU}(\p{L}|\s)+",4)
When used with Word Search, this method highlights separately each occurrence of each specified word.
When used with Positional Phrase Search, this method highlights each occurrence of the positional phrase.
SQL Search Examples
In the following examples, SQL Search Basic index syntax can be used with any type of SQL Search index. SQL Search Semantic index syntax requires a Semantic or Analytic index.
These examples require that you have created and populated the Aviation.TestSQLSrch table, as described in Indexing Sources for SQL Search earlier in this chapter.
For simplicity of display, these examples return record counts rather than the record text itself. These counts are the number of records that match the search criteria, not the number of matches found in the records. A record may contain multiple matches, but is only counted once.
Basic Search Examples
The following examples uses Basic index search to search the Aviation.TestSQLSrch table.
Search for records that contain at least one instance of the words “electrode”, “plug”, and “spark” (in any order):
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrode plug spark',0)
Note that this is word search, not string search. Therefore, the following example may return different results, and may actually return more results than the previous example:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrodes plug spark',0)
Search for records that contain at least one instance of a word beginning with “electrode” (electrode, electrodes), and the word phrase “spark plug” (in any order):
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrode* "spark plug"',0)
Search for records that contain a word beginning with “electrode” (electrode, electrodes), and the word phrase “spark plug” (in any order) within a co-occurrence proximity window of 6 words. Note the punctuation used to specify words and word phrases in a co-occurrence search:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'[electrode*,spark plug,1-6]',0)
Search for records that contain the two different word phrases normal wear and "normal" wear:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'"normal wear"',0)
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'"\"normal\" wear"',0)
Search for records that contain at least one word containing the string seal (seal, seals, unseal, sealant, sealed, previously-sealed), and the word phrase “spark plug”:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'*seal* "spark plug"',0)
Search for records that contain the wildcard phrase “wind from ? ? at ? knots.” Possible values might include “wind from the south at 25 knots” and “wind from 300 degrees at ten knots.” Note that if there is a space between two sequential question marks (? ?) the wildcard represents exactly two words; if there is no space between the two question marks (??) the wildcard represents from 0 to 6 words:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'"wind from ? ? at ? knots"',0)
The following example uses Basic index with Regular Expression search (with n=4). It searches records that contain occurrences of strings specifying dates between “January 10” and “January 29” inclusive:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrBasicIdx,'"January [1-2][0-9]"',4)
For further details, refer to Regular Expressions in Using ObjectScript.
Semantic Syntax Examples
The following examples use Semantic index search to search the Aviation.TestSQLSrch table.
Search for records that contain the NLP entity “spark plug electrodes”:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrSemanticIdx,'{spark plug electrodes}',0)
Search for records that contain an NLP entity ending with “spark plug” or “spark plugs”:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrSemanticIdx,'<{spark plug*}',0)
Search for records that contain both an NLP entity ending with “spark plugs” and the NLP entity “spark plugs”:
SELECT COUNT(Narrative) FROM Aviation.TestSQLSrch 
WHERE %ID %FIND search_index(NarrSemanticIdx,'<{spark plugs} {spark plugs}',0)