Using iKnow
iFind Search Tool
[Back] [Next]
   
Server:docs1
Instance:LATEST
User:UnknownUser
 
-
Go to:
Search:    

This chapter describes the iFind search facility. iFind is an SQL facility for performing text search operations. To use iFind you must define an iFind index for each column containing text that you wish to search. You can then search the text records using a standard SQL query with a WHERE clause containing iFind index syntax.

Note:
To use iFind you must have a Caché license that provides access to iKnow.
Indexing Sources for iFind Search
To perform an iFind search, the column to be searched must have a defined iFind bitmap index. There are three levels of iFind indexes. These levels are defined in nested subclasses. Each index level provides all of the features of the previous level, plus additional iFind features specific to that level. You can create any of the following iFind index types:
Each index level supports all of the parameters of the previous level, and adds one or more additional parameters. Unspecified parameters take default values.
iFind text search is not case sensitive. The LOWER parameter determines whether the query search string is not case sensitive (LOWER=1, the default) or is case sensitive. For example, if LOWER=0, the query search string ‘Pacific’ will return no values, because in the texts to be searched all letters are in lowercase. The query search string ‘pacific’ would return all match values; if LOWER=1, both the words ‘pacific’ and ‘Pacific’ return all match values in the texts to be searched.
The following example creates a table with Basic and Semantic indexes on the Narrative column. Please copy this Caché Class Definition example (in the Samples namespace) into Caché Studio and compile it there.
   Class Aviation.TestIFind Extends %Persistent [ ClassType=persistent,
      DdlAllowed,Owner={UnknownUser},ProcedureBlock,SqlRowIdPrivate,
      SqlTableName=TestIFind ]
  { 
  Property UniqueNum As %Integer;
  Property CrashDate As %TimeStamp [ SqlColumnNumber=2 ];
  Property Narrative As %String(MAXLEN=100000) [ SqlColumnNumber=3 ];
  Index NarrBasicIdx On (Narrative) As %iFind.Index.Basic(INDEXOPTION=0,
     LANGUAGE="en",LOWER=1);
  Index NarrSemanticIdx On (Narrative) As %iFind.Index.Semantic(IFINDATTRIBUTES=1);
  Index UniqueNumIdx On UniqueNum [ Type=index,Unique ];
  }
The following example populates the Aviation.TestIFind table from the Aviation.Events table in the Samples namespace. This example inserts a large amount of text, so running it may take a minute or so:
INSERT OR UPDATE INTO Aviation.TestIFind (UniqueNum,CrashDate,Narrative) 
    SELECT %ID,EventDate,NarrativeFull FROM Aviation.Event
 
This example uses INSERT OR UPDATE with a field defined with a unique key to prevent repeated execution from creating duplicate records.
Performing iFind Search
You use iFind syntax in an SQL query WHERE clause to perform a text search for one or more text items. These text items may be words or sequences of words (Basic index) or iKnow semantic entities (Semantic index). Multiple text items are an implicit AND search; all of the specified items must appear in the text, in any order. The syntax for iFind search is as follows:
WHERE %ID %FIND search_index(indexname,'search_items',search_option,language)
indexname is the name of a defined iFind index for a specific column.
search_items is the list of text items (either words or iKnow entities) to search for, enclosed with quotes. Text items are separated by spaces. An item consists of an alphanumeric string and optional wildcard syntax characters. Text items are not case-sensitive.
search_option is the index option integer that specifies the type of search to perform. Available values include 0 (syntax search), 1 (syntax search with stemming), 2 (syntax search with decompounding and stemming), 3 (syntax search with fuzzy search), and 4 (syntax search with regular expressions). If search_option=4, search_items is assumed to contain a single Regular Expression string. For further details, refer to Regular Expressions in Using Caché ObjectScript.
language is the iKnow-supported language model to apply, specified as a two-character string. For example, 'en' specifies English.
When performing an Basic index search, iFind identifies words by the presence of one or more space characters. Sentence punctuation (a period, comma, semicolon, or colon followed by a space) is ignored. iFind treats all other punctuation as literals. For example, iFind treats “touch-and-go” as a single word. Punctuation such as hyphens or a decimal point in a number are treated as literals. Quote characters and apostrophes must be specified. You specify a single quote character by doubling it.
Basic index search_items can contain the following syntax:
word1 word2 word3 Specifies that these exact words must appear (in any order) somewhere in the text. (Logical AND)
word1 OR word2 NOT word3
word1 OR (word2 AND word3)
search_items can contain AND, OR, and NOT logical operators. AND is the same as separating words with spaces (implicit AND). NOT is logically equivalent to AND NOT. search_items can also use parentheses to group logical operators. Explicit AND is needed when specifying multiple words in grouping parentheses: (word2 AND word3). If the explicit AND was omitted, (word2 word3) would be interpreted as a phrase. You can use the \ escape character to specify AND, OR, NOT as literals rather than logical operators: \and
*word
word*
*word*
An asterisk wildcard specifies 0 or more non-space characters of any type as a suffix or prefix to the word. An asterisk wildcard cannot be specified within a word. You can use \ escape character to specify * as a literal: \*
(word1 word2 word3) Parentheses indicate that the enclosed words form a “phrase”. These exact words must appear sequentially in the specified order. Note that no syntactic analysis is performed; for example, the words in a “phrase” may be the final word of a sentence and the beginning words of the next sentence. An asterisk cannot be used in combination with a phrase.
(word1 ? word3) A question mark indicates that exactly one word is found between the specified words in a phrase. You can specify multiple question marks, each separated by spaces.
Semantic index search_items can contain the following syntax in addition to the Basic index syntax:
{entity} Specifies the exact wording of an iKnow entity.
<{entity} A less-than sign prefix specifies an entity ending with the specified word(s). There must be one or more words in the entity appearing before the specified word(s).
{entity}> A greater-than sign suffix specifies an entity beginning with the specified word(s). There must be one or more words in the entity appearing after the specified word(s).
Multiple search_items can be specified, separated by spaces. This is an implicit AND test. For example, search_index(NarrSemanticIdx,'<{plug electrodes} (flight plan) instruct*',0) means that a text must include one or more iKnow entities that end with “plug electrode”, AND the literal phrase “flight plan”, AND the word “instruct” with a wildcard suffix, allowing for “instructor”, “instructors”, “instruction”, “instructed”, etc.
Validating a search-items String
You can use the %iFind.Utils.TestSearchString() method to validate a search_items string. This method enables you to detect syntax errors and ambiguous use of logical operators. For example, "word1 AND word2 OR word3" fails validation because it is logically ambiguous. Adding parentheses clarifies this string to either "word1 AND (word2 OR word3)" or "(word1 AND word2) OR word3". TestSearchString() returns 1 to indicate a valid search_items string.
The following example invokes this iFind utility as an SQL function:
SELECT %iFind.TestSearchString('orange AND (lemon OR lime)')
 
Fuzzy Search
iFind supports fuzzy search to match records containing elements (words or entities) that “almost” match the search string. Fuzzy search can be used to account for small variations in writing (color vs. colour), misspellings (collor vs color), and different grammatical forms (color vs. colors).
iFind evaluates fuzzy matches by comparing the Levenshtein distance between the two words. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. The maximum number of single-character edits required is know as the maximum edit distance. The iFind maximum edit distance defaults to 2 characters. The maximum edit distance is separately applied to each element in the search string. For Basic iFind index, it is applied to each word in the search string. For Semantic iFind index, it is applied to each iKnow entity in the search string. (The examples that follow assume a Basic iFind index.)
For example, the phrase “analyse programme behaviour” is a fuzzy search match for “analyze program behavior” when maximum edit distance=2, because each word in the search string differs by an edit distance of (at most) 2 characters: analyse=analyze (1 substitution), programme=program (2 deletions), behaviour=behavior (1 deletion).
A word with that is lesser than or equal to the maximum edit distance is a fuzzy search match for any word with an equal or lesser number of characters. For example, if the edit distance is 2, the word “ab” would match any two-letter word (2 substitutions), any one-letter word (1 substitution, 1 deletion), any three-letter word containing either “a” or “b” (1 substitution, 1 insertion), and any four-letter word containing both “a” and “b” in that order (2 insertions).
To specify fuzzy search for search_index() specify search_option as 3 for fuzzy search with the default edit distance of 2, or 3:n for fuzzy search with an edit distance specified as n characters. For example, setting 3:1 sets the edit distance=1, which in English is appropriate for matching most (but not all) singular and plural words. Setting 3:0 sets the edit distance=0, which is the same as iFind search without fuzzy search.
To specify fuzzy search for iFind methods, set the pSearchOption = $$$IFSEARCHFUZZY.
Stemming and Decompounding
Basic index, Semantic index, and Analytic index can all support stemming and decompounding. Stemming and decompounding are word-based, not iKnow entity-based operations. To enable an index for stem-aware searches, specify INDEXOPTION=1; to enable both stem-aware searches and decompounding-aware searches, specify INDEXOPTION=2.
Stemming identifies the Stem Form of each word. The stem form unifies multiple grammatical forms of the same word. When using searchOption=1 at query time, iFind performs search and match operations using the Stem Form of a word, rather than the actual text form. By using searchOption=0, you can use the same index for regular (non-stemmed) searches.
Decompounding divides up compound words into their constituent words. iFind always combines decompounding with stemming; once a word is divided into its constituent parts, each of these parts is automatically stemmed. When using a decompounding search (searchOption=2), iFind compares the decompounding stems of the search term with the decompounded stems of the words in the indexed text field. iFind matches a decompounded word only when the stems of any of its constituent words match all constituent words of the search term.
For example, the search terms “thunder”, “storm”, or “storms” would all match the word “thunderstorms”. However, the search term “thunderstorms” would not match the word “thunder”, because its other constituent word (“storm”) is not matched.
The iFind decompounding algorithm using a language-specific dictionary that identifies possible constituent words. This dictionary should be populated through the %iKnow.Stemming.DecompoundingUtils class. For example, by pointing it to a text column prior to indexing. You may also wish to exempt specific words from decompounding. You can exempt individual words, character sequences, and training data word lists from decompounding using %iKnow.Stemming.DecompoundUtils.
Languages Not Supported by the iKnow Engine
You can use iFind Basic indices to index and search texts in languages for which there is no corresponding iKnow language model.
Because stemming is not dependent on iKnow semantic indexing, you can also perform Basic index word searches on stem forms of words, if a stemmer is available. You must specify INDEXOPTION=1 or INDEXOPTION=2 to perform stem searches. For example, Italian is not an iKnow supported language, but Caché provides a %Text stemmer for Italian.
The following limitations and caveats apply to iFind searching for languages not supported by iKnow:
iFind Examples
In the following examples, Basic index iFind syntax can be used with any type of iFind index. Semantic index iFind syntax requires a Semantic or Analytic iFind index.
These examples require that you have created and populated the Aviation.TestIFind table, as described in Indexing Sources for iFind Search earlier in this chapter.
For simplicity of display, these examples return record counts rather than the record text itself. These counts are the number of records that match the search criteria, not the number of matches found in the records. A record may contain multiple matches, but is only counted once.
Basic Syntax Examples
The following example uses Basic index iFind search to search the Aviation.TestIFind table for records that contain at least one instance of the words “electrode”, “plug”, and “spark” (in any order):
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrode plug spark',0)
 
Note that this is word search, not string search. Therefore, the following example may return different results, and may actually return more results than the previous example:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrodes plug spark',0)
 
The following example uses Basic index iFind to search the Aviation.TestIFind table for records that contain at least one word beginning with “electrode” (electrode, electrodes), and the word phrase “spark plug”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'electrode* (spark plug)',0)
 
The following two examples uses Basic index iFind to search the Aviation.TestIFind table for records that contain the two different word phrases “normal wear” and “"normal" wear”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'(normal wear)',0)
 
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'("normal" wear)',0)
 
The following example uses Basic index iFind to search the Aviation.TestIFind table for records that contain at least one word containing the string seal (seal, seals, unseal, sealant, sealed, previously-sealed), and the word phrase “spark plug”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'*seal* (spark plug)',0)
 
The following example uses Basic index iFind to search the Aviation.TestIFind table for records that contain the wildcard phrase “wind from ? ? at ? knots.” Possible values might include “wind from the south at 25 knots” and “wind from 300 degrees at ten knots.” Note that there must be a space between ? wildcards:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'(wind from ? ? at ? knots)',0)
 
The following example uses Basic index iFind with Regular Expression search (with n=4). It searches records that contain occurrences of strings specifying dates between “January 10” and “January 29” inclusive:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrBasicIdx,'January [1-2][0-9]',4)
 
For further details, refer to Regular Expressions in Using Caché ObjectScript.
Semantic Syntax Examples
The following example uses Semantic index iFind search. It searches the Aviation.TestIFind table for records that contain the iKnow entity “spark plug electrodes”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrSemanticIdx,'{spark plug electrodes}',0)
 
The following example uses Semantic index iFind search. It searches the Aviation.TestIFind table for records that contain an iKnow entity ending with “spark plugs”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrSemanticIdx,'<{spark plugs}',0)
 
The following example uses Semantic index iFind search. It searches the Aviation.TestIFind table for records that contain both an iKnow entity ending with “spark plugs” and the iKnow entity “spark plugs”:
SELECT COUNT(Narrative) FROM Aviation.TestIFind 
WHERE %ID %FIND search_index(NarrSemanticIdx,'<{spark plugs} {spark plugs}',0)