Skip to main content

NLP Implementation

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

The NLP semantic analysis engine is a fully-integrated component of InterSystems IRIS data platform. No separate installation is required. No configuration changes are needed.

Your ability to use NLP is governed by the InterSystems IRIS license. Most standard InterSystems IRIS licenses provide access to NLP. All InterSystems IRIS licenses that support NLP provide full and unlimited use of all components of NLP.

NLP is provided as a collection of APIs containing class object methods and properties which may be invoked from ObjectScript programs. APIs are provided to invoke NLP operations from InterSystems IRIS (API classes). Equivalent APIs are provided to invoke NLP operations from SQL (QAPI classes) and SOAP web services (WSAPI classes). These APIs are described in the %iKnow package in the InterSystems Class Reference. NLP is a core InterSystems IRIS technology and therefore does not have application-like interfaces. However, NLP does provide a few generic, sample output interfaces in the %iKnow.UI package.

To analyze text sources using InterSystems IRIS NLP you must define an NLP domain within an InterSystems IRIS namespace and load the texts into the domain. You can create multiple domains (for multiple contexts) within a single namespace. When you build the domain, NLP generates indexes for the set of text sources within it. All NLP queries and other text processing must specify the domain in which to access this data.

A Note on Program Examples

Many examples in this documentation use data from the Aviation.Event SQL table. If you wish to use this sample data, it is available at https://github.com/intersystems/Samples-AviationOpens in a new tab. (You do not need to know anything about GitHub or have a GitHub account.) Locate the contents of the README.md file, which appears below the filenames and directories included in the GitHub repository. Scroll to the “Setup instructions” section at the bottom of the README.md file and complete the steps.

Many of the program examples in this manual begin by deleting a domain (or its data) and then loading all data from the original text files into an empty domain. For the purpose of these examples, this guarantees that the NLP indexed source data is an exact match with the contents of the file(s) or SQL table(s) from which it was loaded.

This delete/reload methodology is not recommended for real-world applications processing large numbers of text sources. Instead, you should perform the time-consuming load of all sources for a domain once. You can then add or delete individual sources to keep the indexed source data current with the contents of the original text files or SQL tables.

A Note on %Persistent Object Methods

NLP support standard %Persistent object methods for creating and deleting object instances such as domains, configurations, and so forth. These %Persistent method names begin with a % character, such as %New()Opens in a new tab. Use of %Persistent object methods is preferable to using older non-persistent methods, such as Create(). Users are encouraged to use the %Persistent object methods for new code. Program examples throughout this documentation have been revised to use these preferred %Persistent methods.

Note that the %New() persistent method requires a %Save() method. The older Create() method does not require a separate save operation.

A Note on %iKnow and %SYSTEM.iKnow

Throughout this documentation, all NLP classes referred to are located in the %iKnow package. However, the %SYSTEM.iKnowOpens in a new tab class also contains a number of NLP utilities that can be used to simplify coding of common NLP operations. These utilities are provided as shortcuts; all of the operations performed by %SYSTEM.iKnowOpens in a new tab class methods can also be performed by %iKnow package APIs.

You can display information about %SYSTEM.iKnowOpens in a new tab class methods by using the Help() method. To display information about all %SYSTEM.iKnowOpens in a new tab methods, invoke %SYSTEM.iKnow.Help(""); to display information about a specific %SYSTEM.iKnowOpens in a new tab method, supply the method name to the Help() method, as shown in the following example:

  DO ##class(%SYSTEM.iKnow).Help("IndexDirectory")

For further details refer to the InterSystems Class Reference.

Space Requirements and NLP Globals

NLP globals in a namespace have the following prefix: ^IRIS.IK:

  • ^IRIS.IK.* are the final globals, permanent globals that contain NLP data. This NLP data is roughly 20 times the size of the original source texts.

  • ^IRIS.IKS.* are the staging globals. During data loading these can grow to 16 times the size of the original source texts. Staging globals should be mapped to a non-journaled database. NLP automatically deletes these staging globals once source loading and processing is completed.

  • ^IRIS.IKT.* are the temp globals. During data loading these can grow to 4 times the size of the original source texts. Temp globals should be mapped to a non-journaled database. NLP automatically deletes these temp globals once source loading and processing is completed.

  • ^IRIS.IKL.* are logging globals. These are optional and their size is negligible.

Caution:

These globals are for internal use only. Under no circumstances should NLP users attempt to directly interact with NLP globals.

For example, if you are loading 30Gb of source documents, you will need 600Gb of permanent NLP data storage. During data loading you will need 1.17Tb of available space, 600Gb of which will be automatically released once NLP indexing completes.

In addition, the iristemp subdirectory in the Mgr directory may grow to 4 times the size of the original source texts for the duration of file loading and indexing.

You should increase the size of the InterSystems IRIS global buffer, based on the size of the original source texts. Refer to the “Performance Considerations when Loading Texts” chapter in this manual.

Batch Load Space Allocation

InterSystems IRIS allocates 256MB of additional memory for each NLP job to handle batch loading of source texts. By default, NLP allocates one job for each processor core on your system. The $$$IKPJOBS domain parameter establishes the number of NLP jobs; generally the default setting gives optimal results. However, it is recommended that the maximum number of NLP jobs should be either 16 or the number of processor cores, whichever is smaller.

Input Data

NLP is used to analyze unstructured data. Commonly, this data consists of multiple text sources, often a large number of texts. A text source can be of any type, including the following:

  • A file on disk that contain unstructured text data. For example, a txt file.

  • A record in an SQL result set with one or more fields that contain unstructured text data.

  • An RSS web feed containing unstructured text data.

  • An InterSystems IRIS global containing unstructured text data.

NLP does not modify the original text sources, nor does it create a copy of these text sources. Instead, NLP stores its analysis of the original text source as normalized and indexed items, assigning an Id to each item that permits NLP to reference its source. Separate Ids are assigned to items at each level: source, sentence, path, CRC, and entity.

NLP supports texts in the following languages: Czech (cs), Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). You do not have to specify what language your texts contain, nor must all of your texts or all of the sentences in an individual text be in the same language. NLP can automatically identify the language of each sentence of each text and applies the appropriate language model to that sentence. You can define an NLP configuration that specifies the language(s) that your texts contain, and whether or not to perform automatic language identification. Use of an NLP configuration can significantly improve NLP performance.

You do not have to specify a genera for text content (such as medical notes or newspaper articles); NLP automatically handles texts of any content type.

File Formats

NLP accepts source files of any format and with any extension (suffix). By default, NLP assumes that a source text file consists of unformatted text (for example, a .txt file). It will process source files with other formatting (for example, .rtf, .doc) but may treat some formatting elements as text. To avoid this, you can either convert your source files to .txt files and load these .txt files, or you can create an NLP converter to remove formatting from source text during NLP loading.

You specify the list of file extensions as a Lister parameter. Only files with these extensions will be loaded. For example, this list of file extensions can be specified as an AddListToBatch() method parameter.

SQL Record Format

NLP accepts records from an SQL result set as sources. NLP generates a unique integer value for each record as the NLP source Id. NLP allows you to specify an SQL field containing unique values which NLP uses to construct the external Id for the source records. Note that the NLP source Id is assigned by NLP, it is not the external Id, though frequently both are unique integers. Commonly, the NLP source text is taken from only some of the fields of the result set record, often from a single field containing unstructured text data. It can ignore the other fields in the record, or use their values as metadata to filter (include or exclude) or to annotate the source.

Text Normalization

NLP maintains links to the original source text. This enables it to return a sentence with its original capitalization, punctuation, and so forth. Within NLP, normalization operations are performed on entities to facilitate matching:

  • Capitalization is ignored. NLP matching is not case-sensitive. Entity values are returned in all lowercase letters.

  • Extra spaces are ignored. NLP treats all words as being separated by a single space.

  • Multiple periods (...) are reduced to a single period, which NLP treats as a sentence termination character.

  • Most punctuation is used by the language model to identify sentences, concepts and relations, then discarded from further analysis. Punctuation is generally not preserved within entities. Most punctuation is only preserved in an entity when there are no spaces before or after the punctuation mark. However, the slash (/), and at sign (@) are preserved in an entity with or without surrounding spaces.

  • Certain language-specific letter characters are normalized. For example, the German eszett (“ß”) character is normalized as “ss”.

The NLP engine automatically performs text normalization when a source text is indexed. NLP also automatically performs text normalization of dictionary terms and items.

You can also perform NLP text normalization on a string, independent of any NLP data loading, by using the Normalize()Opens in a new tab or NormalizeWithParams()Opens in a new tab methods. This is shown in the following example:

   SET mystring="Stately plump Buck Mulligan   ascended the StairHead,  bearing a shaving bowl"
   SET normstring=##class(%iKnow.Configuration).NormalizeWithParams(mystring)
   WRITE normstring

User-defined Source Normalization

The user can define several types of tools for source normalization:

  • Converters process source text to remove formatting tags and other non-text content during loading.

  • UserDictionary enables the user to specify how to rewrite or use specific input text content elements during loading. For example, UserDictionary can specify substitutions for known abbreviations and acronyms. It is commonly used to standardize text by eliminating variants and synonyms. It can also be used to specify text-specific exceptions to standard NLP processing of punctuation.

Output Structures

NLP creates global structures to store the results of its operations. These global structures are intended for use by NLP class APIs only. They are not user-visible and should not be modified by the user.

NLP indexed data is stored as InterSystems IRIS list structures. Each NLP list structure contains a generated ID for that item, a unique integer value. NLP entities can be accessed either by value or by integer ID.

NLP preserves the relationships amongst indexed entities, so that each entity can reference the entities related to it, the path of that sequence of entities, the original sentence that contains that path, and the location of that sentence within its source text. The original source text is always available for access from NLP. NLP operations do not ever change the original source text.

Constants

NLP defines constant values in the %IKPublic.inc file. After specifying this include file, you can invoke these constants using the $$$ macro invocation, as shown in the following example:

#include %IKPublic
  WRITE "The $$$FILTERONLY constant=",$$$FILTERONLY

These constants include domain parameter names, query parameter values, and other constants.

Error Codes

The General Error Codes 8000-8099 are reserved for use by NLP. For further details, refer to General Error Messages in the InterSystems IRIS Error Reference.

FeedbackOpens in a new tab