docs.intersystems.com
Home  /  Application Development: Analytics Options  /  Using InterSystems Natural Language Processing (NLP)  /  NLP Implementation


Using InterSystems Natural Language Processing (NLP)
NLP Implementation
[Back]  [Next] 
InterSystems: The power behind what matters   
Search:  


The NLP semantic analysis engine is a fully-integrated component of InterSystems IRIS Data Platform™. No separate installation is required. No configuration changes are needed.
Your ability to use NLP is governed by the InterSystems IRIS license. Most standard InterSystems IRIS licenses provide access to NLP. All InterSystems IRIS licenses that support NLP provide full and unlimited use of all components of NLP.
NLP is provided as a collection of APIs containing class object methods and properties which may be invoked from ObjectScript programs. APIs are provided to invoke NLP operations from InterSystems IRIS (API classes). Equivalent APIs are provided to invoke NLP operations from SQL (QAPI classes) and SOAP web services (WSAPI classes). These APIs are described in the %iKnow package in the InterSystems Class Reference. NLP is a core InterSystems IRIS technology and therefore does not have application-like interfaces. However, NLP does provide a few generic, sample output interfaces in the %iKnow.UI package.
To use NLP you must define an NLP domain within an InterSystems IRIS namespace. You can create multiple NLP domains; an InterSystems IRIS namespace can contain multiple NLP domains. All NLP processing occurs within a specified domain. A set of NLP indexed text sources is created within a domain. All NLP queries and other text processing must specify the domain in which to access this data.
A Note on Program Examples
Many examples in this documentation use data from the Aviation.Event SQL table. If you wish to use this sample data, it is available at https://github.com/intersystems/Samples-Aviation. (You do not need to know anything about GitHub or have a GitHub account.) To install these samples, InterSystems recommends that you create a dedicated namespace called (for example) TESTSAMPLES and then load the samples into that namespace (or you can use an existing namespace; however, you cannot use the %SYS namespace). To create a namespace, use the Management Portal options System Administration -> Configuration -> System Configuration -> Namespaces. For the general process of downloading from GitHub, see Downloading Samples for Use with InterSystems IRIS. After you download a sample, be sure to open the README file and follow the setup instructions.
Many of the program examples in this manual begin by deleting a domain (or its data) and then loading all data from the original text files into an empty domain. For the purpose of these examples, this guarantees that the NLP indexed source data is an exact match with the contents of the file(s) or SQL table(s) from which it was loaded.
This delete/reload methodology is not recommended for real-world applications processing large numbers of text sources. Instead, you should perform the time-consuming load of all sources for a domain once. You can then add or delete individual sources to keep the indexed source data current with the contents of the original text files or SQL tables.
A Note on %Persistent Object Methods
NLP support standard %Persistent object methods for creating and deleting object instances such as domains, configurations, and so forth. These %Persistent method names begin with a % character, such as %New(). Use of %Persistent object methods is preferable to using older non-persistent methods, such as Create(). Users are encouraged to use the %Persistent object methods for new code. Program examples throughout this documentation have been revised to use these preferred %Persistent methods.
Note that the %New() persistent method requires a %Save() method. The older Create() method does not require a separate save operation.
A Note on %iKnow and %SYSTEM.iKnow
Throughout this documentation, all NLP classes referred to are located in the %iKnow package. However, the %SYSTEM.iKnow class also contains a number of NLP utilities that can be used to simplify coding of common NLP operations. These utilities are provided as shortcuts; all of the operations performed by %SYSTEM.iKnow class methods can also be performed by %iKnow package APIs.
You can display information about %SYSTEM.iKnow class methods by using the Help() method. To display information about all %SYSTEM.iKnow methods, invoke %SYSTEM.iKnow.Help(""); to display information about a specific %SYSTEM.iKnow method, supply the method name to the Help() method, as shown in the following example:
  DO ##class(%SYSTEM.iKnow).Help("IndexDirectory")
For further details refer to the InterSystems Class Reference.
Space Requirements and NLP Globals
NLP globals in a namespace have the following prefix: ^IRIS.IK:
Caution:
These globals are for internal use only. Under no circumstances should NLP users attempt to directly interact with NLP globals.
For example, if you are loading 30Gb of source documents, you will need 600Gb of permanent NLP data storage. During data loading you will need 1.17Tb of available space, 600Gb of which will be automatically released once NLP indexing completes.
In addition, the iristemp subdirectory in the Mgr directory may grow to 4 times the size of the original source texts for the duration of file loading and indexing.
You should increase the size of the InterSystems IRIS global buffer, based on the size of the original source texts. Refer to the Performance Considerations when Loading Texts chapter in this manual.
Batch Load Space Allocation
InterSystems IRIS allocates 256MB of additional memory for each NLP job to handle batch loading of source texts. By default, NLP allocates one job for each processor core on your system. The $$$IKPJOBS domain parameter establishes the number of NLP jobs; generally the default setting gives optimal results. However, it is recommended that the maximum number of NLP jobs should be either 16 or the number of processor cores, whichever is smaller.
Input Data
NLP is used to analyze unstructured data. Commonly, this data consists of multiple text sources, often a large number of texts. A text source can be of any type, including the following:
NLP does not modify the original text sources, nor does it create a copy of these text sources. Instead, NLP stores its analysis of the original text source as normalized and indexed items, assigning an Id to each item that permits NLP to reference its source. Separate Ids are assigned to items at each level: source, sentence, path, CRC, and entity.
NLP supports texts in the following languages: Dutch (nl), English (en), French (fr), German (de), Japanese (ja), Portuguese (pt), Russian (ru), Spanish (es), Swedish (sv), and Ukrainian (uk). You do not have to specify what language your texts contain, nor must all of your texts or all of the sentences in an individual text be in the same language. NLP can automatically identify the language of each sentence of each text and applies the appropriate language model to that sentence. You can define an NLP configuration that specifies the language(s) that your texts contain, and whether or not to perform automatic language identification. Use of an NLP configuration can significantly improve NLP performance.
You do not have to specify a genera for text content (such as medical notes or newspaper articles); NLP automatically handles texts of any content type.
File Formats
NLP accepts source files of any format and with any extension (suffix). By default, NLP assumes that a source text file consists of unformatted text (for example, a .txt file). It will process source files with other formatting (for example, .rtf, .doc) but may treat some formatting elements as text. To avoid this, you can either convert your source files to .txt files and load these .txt files, or you can create an NLP converter to remove formatting from source text during NLP loading.
You specify the list of file extensions as a Lister parameter. Only files with these extensions will be loaded. For example, this list of file extensions can be specified as an AddListToBatch() method parameter.
SQL Record Format
NLP accepts records from an SQL result set as sources. NLP generates a unique integer value for each record as the NLP source Id. NLP allows you to specify an SQL field containing unique values which NLP uses to construct the external Id for the source records. Note that the NLP source Id is assigned by NLP, it is not the external Id, though frequently both are unique integers. Commonly, the NLP source text is taken from only some of the fields of the result set record, often from a single field containing unstructured text data. It can ignore the other fields in the record, or use their values as metadata to filter (include or exclude) or to annotate the source.
Text Normalization
NLP maintains links to the original source text. This enables it to return a sentence with its original capitalization, punctuation, and so forth. Within NLP, normalization operations are performed on entities to facilitate matching:
The NLP engine automatically performs text normalization when a source text is indexed. NLP also automatically performs text normalization of dictionary terms and items.
You can also perform NLP text normalization on a string, independent of any NLP data loading, by using the Normalize() or NormalizeWithParams() methods. This is shown in the following example:
   SET mystring="Stately plump Buck Mulligan   ascended the StairHead,  bearing a shaving bowl"
   SET normstring=##class(%iKnow.Configuration).NormalizeWithParams(mystring)
   WRITE normstring
User-defined Source Normalization
The user can define several types of tools for source normalization:
Output Structures
NLP creates global structures to store the results of its operations. These global structures are intended for use by NLP class APIs only. They are not user-visible and should not be modified by the user.
NLP indexed data is stored as InterSystems IRIS list structures. Each NLP list structure contains a generated ID for that item, a unique integer value. NLP entities can be accessed either by value or by integer ID.
NLP preserves the relationships amongst indexed entities, so that each entity can reference the entities related to it, the path of that sequence of entities, the original sentence that contains that path, and the location of that sentence within its source text. The original source text is always available for access from NLP. NLP operations do not ever change the original source text.
Constants
NLP defines constant values in the %IKPublic.inc file. After specifying this include file, you can invoke these constants using the $$$ macro invocation, as shown in the following example:
#Include %IKPublic
  WRITE "The $$$FILTERONLY constant=",$$$FILTERONLY
These constants include domain parameter names, query parameter values, and other constants.
Error Codes
The General Error Codes 8000-8099 are reserved for use by NLP. For further details, refer to General Error Messages in the InterSystems IRIS Error Reference.