iKnow Implementation

The iKnow semantic analysis engine is a fully-integrated component of Caché. No separate installation is required. No configuration changes are needed.

Your ability to use iKnow is governed by the Caché license. Most standard Caché licenses provide access to iKnow. All Caché licenses that support iKnow provide full and unlimited use of all components of iKnow.

iKnow is provided as a collection of APIs containing class object methods and properties which may be invoked from ObjectScript programs. APIs are provided to invoke iKnow operations from Caché (API classes). Equivalent APIs are provided to invoke iKnow operations from SQL (QAPI classes) and SOAP web services (WSAPI classes). These APIs are described in the %iKnow package in the InterSystems Class Reference. iKnow is a core Caché technology and therefore does not have application-like interfaces. However, iKnow does provide a few generic, sample output interfaces in the %iKnow.UI package.

To use iKnow you must define an iKnow domain within a Caché namespace. You can create multiple iKnow domains; a Caché namespace can contain multiple iKnow domains. All iKnow processing occurs within a specified domain. A set of iKnow indexed text sources is created within a domain. All iKnow queries and other text processing must specify the domain in which to access this data.

A Note on Program Examples

Many of the program examples in this manual begin by deleting a domain (or its data) and then loading all data from the original text files into an empty domain. For the purpose of these examples, this guarantees that the iKnow indexed source data is an exact match with the contents of the file(s) or SQL table(s) from which it was loaded.

This delete/reload methodology is not recommended for real-world applications processing large numbers of text sources. Instead, you should perform the time-consuming load of all sources for a domain once. You can then add or delete individual sources to keep the indexed source data current with the contents of the original text files or SQL tables.

A Note on %Persistent Object Methods

iKnow support standard %Persistent object methods for creating and deleting object instances such as domains, configurations, and so forth. These %Persistent method names begin with a % character, such as %New()Opens in a new tab. Use of %Persistent object methods is preferable to using older non-persistent methods, such as Create(). Users are encouraged to use the %Persistent object methods for new code. Program examples throughout this documentation have been revised to use these preferred %Persistent methods.

Note that the %New() persistent method requires a %Save() method. The older Create() method does not require a separate save operation.

A Note on %iKnow and %SYSTEM.iKnow

Throughout this documentation, all classes referred to are located in the %iKnow package. However, the %SYSTEM.iKnowOpens in a new tab class also contains a number of iKnow utilities that can be used to simplify coding of common iKnow operations. These utilities are provided as shortcuts; all of the operations performed by %SYSTEM.iKnowOpens in a new tab class methods can also be performed by %iKnow package APIs.

You can display information about %SYSTEM.iKnowOpens in a new tab class methods by using the Help() method. To display information about all %SYSTEM.iKnowOpens in a new tab methods, invoke %SYSTEM.iKnow.Help(""); to display information about a specific %SYSTEM.iKnowOpens in a new tab method, supply the method name to the Help() method, as shown in the following example:

  DO ##class(%SYSTEM.iKnow).Help("IndexDirectory")

For further details refer to the InterSystems Class Reference.

Space Requirements and iKnow Globals

iKnow globals in a namespace have the following prefix: ^ISC.IK:

^ISC.IK.* are the final globals, permanent globals that contain iKnow data. This iKnow data is roughly 20 times the size of the original source texts.
^ISC.IKS.* are the staging globals. During data loading these can grow to 16 times the size of the original source texts. Staging globals should be mapped to a non-journaled database. iKnow automatically deletes these staging globals once source loading and processing is completed.
^ISC.IKT.* are the temp globals. During data loading these can grow to 4 times the size of the original source texts. Temp globals should be mapped to a non-journaled database. iKnow automatically deletes these temp globals once source loading and processing is completed.
^ISC.IKL.* are logging globals. These are optional and their size is negligible.

Caution:

These globals are for internal use only. Under no circumstances should iKnow users attempt to directly interact with iKnow globals.

For example, if you are loading 30Gb of source documents, you will need 600Gb of permanent iKnow data storage. During data loading you will need 1.17Tb of available space, 600Gb of which will be automatically released once iKnow indexing completes.

In addition, the cachetemp subdirectory in the Mgr directory may grow to 4 times the size of the original source texts for the duration of file loading and indexing.

These space requirements apply when you create a domain in Caché 2012.2 and load iKnow data, or when you upgrade a domain created in Caché 2012.1 (iKnow version 1) to support Caché 2012.2 (iKnow version 2) features. They do not apply to existing domains created and loaded with iKnow data in Caché 2012.1, or to any data added to a 2012.1 domain in 2012.2. Caché 2012.1 domains have smaller space requirements (and support fewer features), as described in Upgrading iKnow Data in the “iKnow Tools” chapter.

Caché 2013.1 (iKnow version 3) data space requirements are not significantly larger than those for Caché 2012.2.

You should increase the size of the Caché global buffer, based on the size of the original source texts. Refer to the “Performance Considerations when Loading Texts” chapter in this manual.

Batch Load Space Allocation

Caché allocates 256MB of additional memory for each iKnow job to handle batch loading of source texts. By default, iKnow allocates one job for each processor core on your system. The $$$IKPJOBS domain parameter establishes the number of iKnow jobs; generally the default setting gives optimal results. However, it is recommended that the maximum number of iKnow jobs should be either 16 or the number of processor cores, whichever is smaller.

Input Data

iKnow is used to analyze unstructured data. Commonly, this data consists of multiple text sources, often a large number of texts. A text source can be of any type, including the following:

A file on disk that contain unstructured text data. For example, a txt file.
A record in an SQL result set with one or more fields that contain unstructured text data.
An RSS web feed containing unstructured text data.
A Caché global containing unstructured text data.

iKnow does not modify the original text sources, nor does it create a copy of these text sources. Instead, iKnow stores its analysis of the original text source as normalized and indexed items, assigning an Id to each item that permits iKnow to reference its source. Separate Ids are assigned to items at each level: source, sentence, path, CRC, and entity.

iKnow supports texts in the following languages: Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). You do not have to specify what language your texts contain, nor must all of your texts or all of the sentences in an individual text be in the same language. iKnow can automatically identify the language of each sentence of each text and applies the appropriate language model to that sentence. You can define an iKnow configuration that specifies the language(s) that your texts contain, and whether or not to perform automatic language identification. Use of an iKnow configuration can significantly improve iKnow performance.

You do not have to specify a genera for text content (such as medical notes or newspaper articles); iKnow automatically handles texts of any content type.

File Formats

iKnow accepts source files of any format and with any extension (suffix). By default, iKnow assumes that a source text file consists of unformatted text (for example, a .txt file). It will process source files with other formatting (for example, .rtf, .doc) but may treat some formatting elements as text. To avoid this, you can either convert your source files to .txt files and load these .txt files, or you can create an iKnow converter to remove formatting from source text during iKnow loading.

You specify the list of file extensions as a Lister parameter. Only files with these extensions will be loaded. For example, this list of file extensions can be specified as an AddListToBatch() method parameter.

SQL Record Format

iKnow accepts records from an SQL result set as sources. iKnow generates a unique integer value for each record as the iKnow source Id. iKnow allows you to specify an SQL field containing unique values which iKnow uses to construct the external Id for the source records. Note that the iKnow source Id is assigned by iKnow, it is not the external Id, though frequently both are unique integers. Commonly, the iKnow source text is taken from only some of the fields of the result set record, often from a single field containing unstructured text data. It can ignore the other fields in the record, or use their values as metadata to filter (include or exclude) or to annotate the source.

Text Normalization

iKnow maintains links to the original source text. This enables it to return a sentence with its original capitalization, punctuation, and so forth. Within iKnow, normalization operations are performed on entities to facilitate matching:

Capitalization is ignored. iKnow matching is not case-sensitive. Entity values are returned in all lowercase letters.
Extra spaces are ignored. iKnow treats all words as being separated by a single space.
Multiple periods (...) are reduced to a single period, which iKnow treats as a sentence termination character.
Most punctuation is used by the language model to identify sentences, concepts and relations, then discarded from further analysis. Punctuation is generally not preserved within entities. Most punctuation is only preserved in an entity when there are no spaces before or after the punctuation mark. However, the slash (/), and at sign (@) are preserved in an entity with or without surrounding spaces.
Certain language-specific letter characters are normalized. For example, the German eszett (“ß”) character is normalized as “ss”.

The iKnow engine automatically performs text normalization when a source text is indexed. iKnow also automatically performs text normalization of dictionary terms and items.

You can also perform iKnow text normalization on a string, independent of any iKnow data loading, by using the Normalize()Opens in a new tab or NormalizeWithParams()Opens in a new tab methods. This is shown in the following example:

   SET mystring="Stately plump Buck Mulligan   ascended the StairHead,  bearing a shaving bowl"
   SET normstring=##class(%iKnow.Configuration).NormalizeWithParams(mystring)
   WRITE normstring

User-defined Source Normalization

The user can define several types of tools for source normalization:

Converters process source text to remove formatting tags and other non-text content during loading.
UserDictionary enables the user to specify how to rewrite or use specific input text content elements during loading. For example, UserDictionary can specify substitutions for known abbreviations and acronyms. It is commonly used to standardize text by eliminating variants and synonyms. It can also be used to specify text-specific exceptions to standard iKnow processing of punctuation.

Output Structures

iKnow creates global structures to store the results of its operations. These global structures are intended for use by iKnow class APIs only. They are not user-visible and should not be modified by the user.

iKnow indexed data is stored as Caché list structures. Each iKnow list structure contains a generated ID for that item, a unique integer value. iKnow entities can be accessed either by value or by integer ID.

iKnow preserves the relationships amongst indexed entities, so that each entity can reference the entities related to it, the path of that sequence of entities, the original sentence that contains that path, and the location of that sentence within its source text. The original source text is always available for access from iKnow. iKnow operations do not ever change the original source text.

Constants

iKnow defines constant values in the %IKPublic.inc file. After specifying this include file, you can invoke these constants using the $$$ macro invocation, as shown in the following example:

#include %IKPublic
  WRITE "The $$$FILTERONLY constant=",$$$FILTERONLY

These constants include domain parameter names, query parameter values, and other constants.

Error Codes

The General Error Codes 8000-8099 are reserved for use by iKnow. For further details, refer to General Error Messages in the Caché Error Reference.