Using InterSystems IRIS Natural Language Processing (NLP)
The NLP semantic analysis engine is a fully-integrated component of InterSystems IRIS Data Platform™. No separate installation is required. No configuration changes are needed.
Your ability to use NLP is governed by the InterSystems IRIS license. Most standard InterSystems IRIS licenses provide access to NLP. All InterSystems IRIS licenses that support NLP provide full and unlimited use of all components of NLP.
NLP is provided as a collection of APIs containing class object methods and properties which may be invoked from ObjectScript programs. APIs are provided to invoke NLP operations from InterSystems IRIS (API classes). Equivalent APIs are provided to invoke NLP operations from SQL (QAPI classes) and SOAP web services (WSAPI classes). These APIs are described in the %iKnow
package in the InterSystems Class Reference
. NLP is a core InterSystems IRIS technology and therefore does not have application-like interfaces. However, NLP does provide a few generic, sample output interfaces in the %iKnow.UI
To use NLP you must define an NLP domain within an InterSystems IRIS namespace. You can create multiple NLP domains; an InterSystems IRIS namespace can contain multiple NLP domains. All NLP processing occurs within a specified domain
. A set of NLP indexed text sources is created within a domain. All NLP queries and other text processing must specify the domain in which to access this data.
Many of the program examples in this manual begin by deleting a domain (or its data) and then loading all data from the original text files into an empty domain. For the purpose of these examples, this guarantees that the NLP indexed source data is an exact match with the contents of the file(s) or SQL table(s) from which it was loaded.
This delete/reload methodology is not recommended for real-world applications processing large numbers of text sources. Instead, you should perform the time-consuming load of all sources for a domain once. You can then add or delete individual sources to keep the indexed source data current with the contents of the original text files or SQL tables.
NLP support standard %Persistent object methods for creating and deleting object instances such as domains, configurations, and so forth. These %Persistent method names begin with a % character, such as %New()
. Use of %Persistent object methods is preferable to using older non-persistent methods, such as Create()
. Users are encouraged to use the %Persistent object methods for new code. Program examples throughout this documentation have been revised to use these preferred %Persistent methods.
Note that the %New()
persistent method requires a %Save()
method. The older Create()
method does not require a separate save operation.
Throughout this documentation, all NLP classes referred to are located in the %iKnow
package. However, the %SYSTEM.iKnow
class also contains a number of NLP utilities that can be used to simplify coding of common NLP operations. These utilities are provided as shortcuts; all of the operations performed by %SYSTEM.iKnow
class methods can also be performed by %iKnow
NLP globals in a namespace have the following prefix: ^IRIS.IK:
^IRIS.IK.* are the final globals, permanent globals that contain NLP data. This NLP data is roughly 20 times the size of the original source texts.
^IRIS.IKS.* are the staging globals. During data loading these can grow to 16 times the size of the original source texts. Staging globals should be mapped to a non-journaled database. NLP automatically deletes these staging globals once source loading and processing is completed.
^IRIS.IKT.* are the temp globals. During data loading these can grow to 4 times the size of the original source texts. Temp globals should be mapped to a non-journaled database. NLP automatically deletes these temp globals once source loading and processing is completed.
^IRIS.IKL.* are logging globals. These are optional and their size is negligible.
These globals are for internal use only. Under no circumstances should NLP users attempt to directly interact with NLP globals.
For example, if you are loading 30Gb of source documents, you will need 600Gb of permanent NLP data storage. During data loading you will need 1.17Tb of available space, 600Gb of which will be automatically released once NLP indexing completes.
In addition, the iristemp subdirectory in the Mgr directory may grow to 4 times the size of the original source texts for the duration of file loading and indexing.
InterSystems IRIS allocates 256MB of additional memory for each NLP job to handle batch loading
of source texts. By default, NLP allocates one job for each processor core on your system. The $$$IKPJOBS domain parameter
establishes the number of NLP jobs; generally the default setting gives optimal results. However, it is recommended that the maximum number of NLP jobs should be either 16 or the number of processor cores, whichever is smaller.
NLP is used to analyze unstructured data. Commonly, this data consists of multiple text sources, often a large number of texts. A text source can be of any type, including the following:
A file on disk that contain unstructured text data. For example, a txt file.
A record in an SQL result set with one or more fields that contain unstructured text data.
An RSS web feed containing unstructured text data.
An InterSystems IRIS global containing unstructured text data.
NLP does not modify the original text sources, nor does it create a copy of these text sources. Instead, NLP stores its analysis of the original text source as normalized and indexed items, assigning an Id to each item that permits NLP to reference its source. Separate Ids are assigned to items at each level: source, sentence, path, CRC, and entity.
NLP supports texts in the following languages: Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). You do not have to specify what language your texts contain, nor must all of your texts or all of the sentences in an individual text be in the same language. NLP can automatically identify the language of each sentence of each text and applies the appropriate language model to that sentence. You can define an NLP configuration that specifies the language(s) that your texts contain, and whether or not to perform automatic language identification
. Use of an NLP configuration can significantly improve NLP performance.
You do not have to specify a genera for text content (such as medical notes or newspaper articles); NLP automatically handles texts of any content type.
NLP accepts source files of any format and with any extension (suffix). By default, NLP assumes that a source text file consists of unformatted text (for example, a .txt file). It will process source files with other formatting (for example, .rtf, .doc) but may treat some formatting elements as text. To avoid this, you can either convert your source files to .txt files and load these .txt files, or you can create an NLP converter
to remove formatting from source text during NLP loading.
You specify the list of file extensions as a Lister parameter
. Only files with these extensions will be loaded. For example, this list of file extensions can be specified as an AddListToBatch()
NLP accepts records from an SQL result set as sources. NLP generates a unique integer value for each record as the NLP source Id. NLP allows you to specify an SQL field containing unique values which NLP uses to construct the external Id for the source records. Note that the NLP source Id is assigned by NLP, it is not
the external Id, though frequently both are unique integers. Commonly, the NLP source text is taken from only some of the fields of the result set record, often from a single field containing unstructured text data. It can ignore the other fields in the record, or use their values as metadata to filter
(include or exclude) or to annotate the source.
NLP maintains links to the original source text. This enables it to return a sentence with its original capitalization, punctuation, and so forth. Within NLP, normalization operations are performed on entities to facilitate matching:
Capitalization is ignored. NLP matching is not case-sensitive. Entity values are returned in all lowercase letters.
Extra spaces are ignored. NLP treats all words as being separated by a single space.
Multiple periods (...) are reduced to a single period, which NLP treats as a sentence termination character.
Most punctuation is used by the language model to identify sentences, concepts and relations, then discarded from further analysis. Punctuation is generally not preserved within entities. Most punctuation is only preserved in an entity when there are no spaces before or after the punctuation mark. However, the slash (/), and at sign (@) are preserved in an entity with or without surrounding spaces.
Certain language-specific letter characters are normalized. For example, the German eszett
(ß) character is normalized as ss.
The NLP engine automatically performs text normalization when a source text is indexed. NLP also automatically performs text normalization of dictionary terms and items.
You can also perform NLP text normalization on a string, independent of any NLP data loading, by using the Normalize()
methods. This is shown in the following example:
SET mystring="Stately plump Buck Mulligan ascended the StairHead, bearing a shaving bowl"
The user can define several types of tools for source normalization:
process source text to remove formatting tags and other non-text content during loading.
enables the user to specify how to rewrite or use specific input text content elements during loading. For example, UserDictionary can specify substitutions for known abbreviations and acronyms. It is commonly used to standardize text by eliminating variants and synonyms. It can also be used to specify text-specific exceptions to standard NLP processing of punctuation.
NLP creates global structures to store the results of its operations. These global structures are intended for use by NLP class APIs only. They are not user-visible and should not be modified by the user.
NLP indexed data is stored as InterSystems IRIS list structures
. Each NLP list structure contains a generated ID for that item, a unique integer value. NLP entities can be accessed either by value or by integer ID.
NLP preserves the relationships amongst indexed entities, so that each entity can reference the entities related to it, the path of that sequence of entities, the original sentence that contains that path, and the location of that sentence within its source text. The original source text is always available for access from NLP. NLP operations do not ever change the original source text.
NLP defines constant values in the %IKPublic.inc file. After specifying this include file, you can invoke these constants using the $$$ macro invocation, as shown in the following example:
WRITE "The $$$FILTERONLY constant=",$$$FILTERONLY
These constants include domain parameter names, query parameter values, and other constants.