Skip to main content

Conceptual Overview

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

The NLP semantic analysis engine is used to analyze unstructured data on InterSystems IRIS data platform, data that is written as text in a human language such as English or French. By providing the ability to rapidly access and analyze this type of data, NLP allows you to work with all of your data. NLP does not require you to have any prior knowledge of the contents of this data, or even know what language it is written in, so long as it is one of the languages that NLP supports.

Commonly, unstructured data consists of multiple source texts, often a very large number of texts. A source text is commonly running text divided by punctuation into sentences. A source text can be a file, a record in a SQL result set, or a web source such as a series of blog entries.

A Simple Use Case

To see how NLP handles unstructured data, consider the following sentence:

General Motors builds their Chevrolet Volt in the Detroit-Hamtramck assembly plant.

First, NLP divides a text into sentences. Then it automatically identifies the language for each sentence and analyzes the sentence semantically based on a set of semantic rules which correspond to the language (a language model). It does not need to “understand” or look up the words in the sentence. NLP indexes the resulting semantic entities from this sentence (normalizing letters to lower case):

[general motors] {builds} (their) [chevrolet volt] {in} the [detroit-hamtramck assembly plant]

From the initial sentence NLP has identified the following entities:

  • 3 Concepts: [general motors] [chevrolet volt] [detroit-hamtramck assembly plant].

  • 2 Relations: {builds} {in}.

  • 1 Path-relevant: (their). (Path-relevents are considered in path analysis, but are not indexed.

  • 1 Non-relevant: the. (Non-relevants are discarded from further NLP analysis.)

Note:

For the purpose of illustration, this example shows each Concept delimited by square brackets, each Relation delimited by curly braces, and each path-relevant delimited by parentheses; within NLP such delimiter characters are not used.

NLP assigns each entity a unique Id. NLP also identifies sequences of these entities that follow the pattern Concept-Relation-Concept (CRC). In this example there are two CRCs:

[general motors] {builds} [chevrolet volt]
[chevrolet volt] {in} (their) [detroit-hamtramck assembly plant]

NLP assigns each CRC a unique Id.

NLP also recognizes that this sentence contains a continuous sequence of entities (in this case, the sequence CRCRC). A sequence of entities within a sentence that together express a single statement is known as a Path. The entire sentence may be a single Path, or may contain multiple Paths. NLP assigns each Path a unique Id.

NLP has now identified the sentences in the text, the relevant Entities in each sentence, which Entities are Concepts and which are Relations, which of these form CRC sequences, and which sequences of entities form a Path. By using these semantic units, NLP can return many types of meaningful information about the contents of the source texts.

What is NLP?

NLP provides access to unstructured data by dividing up text into relational and associated entities and producing an index of these entities. It divides a text into sentences, then divides each sentence into a sequence of Concepts and Relations. It performs this operation by identifying the language of the text (for example, English), then applying the corresponding NLP language model.

  • A Relation is a word or group of words that join two Concepts by specifying a relationship between them. NLP contains a compact language model that is able to identify the Relations in a sentence.

  • A Concept is a word or group of words that is associated by a Relation. By determining what is a Relation, NLP can identify associated Concepts. Thus the NLP analysis engine can identify Concepts semantically without “understanding” their content.

Note:

For the purpose of explanation, verbs are commonly Relations and nouns with their associated adjectives are commonly Concepts. However, the linguistic model of Relations and Concepts is significantly more inclusive and more sophisticated than the distinction between verbs and nouns.

Thus NLP divides a sentence into Concepts (C) and Relations (R). The language model uses a relatively small and fixed dictionary of relationship words and a set of context rules to identify Relations. Anything not identified as a Relation is considered a Concept. (NLP also identifies non-relevant words, such as “the” and “a”, and discards them from further analysis.)

Relations and Concepts are collectively known as Entities. However, a Relation is almost never meaningful without an associated Concept. For this reason, NLP entity analysis emphasizes Concepts and sequences that contain Concepts associated by a Relation.

Because NLP analyzes text using a small and stable language model focused on identifying Relations, NLP can rapidly index texts containing any subject matter. NLP does not need to use a dictionary or ontology to identify Concepts.

Once NLP has identified the Concepts and Relations in each sentence in a text, or (more commonly) in many texts, this information can be used to perform the following types of operations:

  • Smart Indexing: provides insight into what’s relevant, what’s related, and what’s representative from a large body of unstructured text.

  • Smart Matching: provides a means to associate entities in the source texts with external items such as lists or dictionaries. These lists can contain words, phrases, or sentences for full (identical) matching and partial matching, and can contain templates for matching by format.

What NLP Isn’t

NLP is not a search tool. Search tools enable you to locate only those things that you already believe are in the text. InterSystems provides InterSystems SQL Search as a search tool for unstructured text data in SQL tables. SQL Search uses many of the features of NLP to provide intelligent text search.

NLP is a content analysis tool. NLP enables you to use the entire contents of the text data, including texts whose content is wholly unknown to you.

NLP is not a dictionary-based content analyzer. Unlike dictionary-based tools, it does not break up sentences into individual words then attempt to “understand” those words and reconstruct context. NLP simply identifies Entities semantically. It does not need to look up these Entities in a dictionary or ontology. For this reason its language model is compact, stable, and general-purpose; you do not have to specify any information about the type of texts being analyzed (medical, legal, etc.), or provide a separate dictionary of relevant terms. While NLP can be extended by associating a dictionary or ontology of terms, its essential functions do not require one. Thus it does not require the creation, customizing, or periodic updating of a dictionary.

NLP supports stemming, but is not by default a stemming tool. By default it does not reduce relations or concepts to stem forms. Instead, it treats each element as a distinct entity, then identifies its degree of similarity to other elements. NLP supports stemming as an optional feature; it is recommended primarily for use with Russian and Ukrainian text sources. InterSystems IRIS also provides a set of classes you can use to perform stemming: the %TextOpens in a new tab package, as described in the InterSystems Class Reference. %TextOpens in a new tab and %iKnow are wholly independent of each other and are used for different purposes.

Logical Text Units Identified by NLP

Sentences

NLP uses a language model to divide the source text into sentences. In general, NLP defines a sentence as a unit of text ending with a sentence terminator (usually a punctuation mark) followed by at least one space or line return. The next sentence begins with the next non-whitespace character. Capitalization is not required to indicate the beginning of a sentence.

Sentence terminators are (for most languages) the period (.), question mark (?), exclamation mark (!), and semi-colon (;). A sentence can be terminated by more than one terminator, such as ellipsis (...) or emphatic punctuation (??? or !!!). Any combination of terminators is permitted (...!?). A blank space between sentence terminators indicates a new sentence; therefore, an ellipsis containing spaces (. . .) is actually three sentences. A sentence terminator must be followed by either a whitespace character (space, tab, or line return), or by a single quote or double quote character, followed by a whitespace character. For example, "Why?" he asked. is two sentences, but "Why?", he asked. is a single sentence.

A double line return acts as a sentence terminator with or without a sentence terminator character. Therefore, a title or a section heading is considered to be a separate sentence, if followed by a blank line. The end of the file is also treated as a sentence terminator. Therefore, if a source contains any content at all (other than whitespace) it contains at least one sentence, regardless of the presence of a sentence terminator. Similarly, the last text in a file is treated as a separate sentence, regardless of the presence of a sentence terminator.

A period followed by a blank space usually indicates a sentence break, though NLP language models recognize exceptions to this rule. For example, the English language model recognizes common abbreviations, such as “Dr.” and “Mr.” (not case-sensitive) and removes the period rather than performing a sentence break. The English language model recognizes “No.” as an abbreviation, but treats lowercase “no.” as a sentence terminator.

You can use the UserDictionary option of your Configuration to cause or avoid sentence endings in specific cases. For example, the abbreviation “Fr.” (Father or Friar) is not recognized by the English language model. It is treated as a sentence break. You can use a UserDictionary to either remove the period or to specify that this use of a period should not cause a sentence break. A UserDictionary is applied as a source is loaded; already loaded sources are not affected.

Entities

An entity is a minimal logical unit of text. It is either a word or a group of words that NLP logically groups together into either a concept or a relation. Other logical units, such as a telephone number or an email address, are also considered entities (and are treated as concepts).

Note:

Japanese text cannot be divided into concepts and relations. Instead NLP analyzes Japanese text as a sequence of entities with associated particles. The definition of an “entity” for Japanese is roughly equivalent to a Concept in other NLP languages. For a description of NLP Japanese support (written in Japanese) refer to NLP JapaneseOpens in a new tab.

NLP normalizes entities so that they may be compared and counted. It removes non-relevant words. It translates entities into lower case letters. It removes most punctuation and some special characters from entities.

By default, NLP restricts its analysis of entities to Concepts. By default, Relations are only analyzed because of their role in linking Concepts together. This default can be overridden, as described in the “Limiting by Position” section of the NLP Queries chapter.

Path-relevant Words

NLP identifies certain words in each language as being an essential part of its analysis of sentences and paths, but otherwise not relevant. Outside of the context of a sentence or path, these words have little informational content. The following are typical path-relevant words:

  • Pronouns of all types: definite, indefinite, possessive.

  • Indefinite expressions of time, frequency, or place. For example, “then”, “soon”, “later”, “sometimes”, “all”, “here”.

Path-relevant words are not considered Concepts, nor are they counted in frequency or dominance calculations. Path-relevant words may be negation or time attribute markers. For example, ”none”, nothing”, “nowhere”, “nobody”. Path-relevant words are not stemmed.

Non-relevant Words

NLP identifies certain words in each language as being non-relevant, and excludes these words from NLP indexing. There are several kinds of non-relevant words:

  • Articles (such as “the” and “a”) and other words that the NLP language model identifies as having little or no semantic importance.

  • Prefatory words or phrases at the beginning of a sentence, such as “And”, “Nevertheless”, “However”, “On the other hand”.

  • Character strings over 150 characters that are unbroken by spaces or sentence punctuation. A “word” of this length is highly likely to be a non-text entity, and is thus excluded from NLP indexing. Because in rare cases (such as chemical nomenclature or URL strings) these 150+ character words are semantically relevant, NLP flags them with the attribute “nonsemantic”.

Non-relevant words are excluded from NLP indexing, but are preserved when sentences are displayed.

CRCs and CCs

Once NLP divides a sentence into Concepts (C) and Relations (R), it can determine several types of connections between these fundamental entities.

  • CRC is a Concept-Relation-Concept sequence. A CRC is handled as a Head Concept - Relation - qConcept sequence. Whether an entity is a Head, Relation, or Tail is known as its position. In some cases, a CRC may have an empty string value for one of the sequence members (CR or RC); this can occur, for example, when the Relation of the CRC is an intransitive verb: “Robert slept.”

  • CC is a Concept + Concept pair. NLP retains the position of each Concept, but ignores the Relation between the two Concepts. A CC can either be handled as two associated Concepts, or as a Head Concept/Tail Concept sequence. You can use CC pairs to identify associated Concepts without regard to their head/tail positions or the linking Relation. This is especially useful when determining a network of Concepts — what Concepts have a connection to what other Concepts. You can also use CC pairs as a head/tail sequence.

Note:

Japanese cannot be analyzed semantically in terms of CRCs or CCs because NLP does not divide Japanese entities into concepts and relations.

Paths

A Path is a meaningful sequence of Entities through a sentence. In Western languages, Paths are commonly based on sequential CRCs, thus resulting Paths have the entities (Concepts & Relations) in their original sentence order. Commonly, though not exclusively, this takes the form of a continuous sequence of CRCs. For example, in a common path sequence the Tail Concept of one CRC becomes the Head Concept of the next CRC. This results in a path consisting of five entities: C-R-C-R-C. Other meaningful sequences of Concepts and Relations are also treated as paths, such as a sequence that contains a path-relevant pronoun as a stand-in for a Concept.

In Japanese, Paths cannot be based on the sequence of Entities in the original sentence. NLP nevertheless does identify Paths as meaningful sequences of Entities within Japanese text. NLP semantic analysis of Japanese uses an entity vector algorithm to create Entity Vectors. When NLP converts a Japanese sentence into an Entity Vector it commonly lists the Entities in a different order than the original sentence to indicate which Entities are linked to each other and how strong the link between them is. The resulting Entity Vector is used for Path analysis.

A Path must contain at least two Entities. Not all sentences are paths; some very short sentences may not contain the minimum number of Entities to qualify as a path.

A path is always contained within a single sentence. However, a sentence may contain more than one path. This can occur when NLP identifies a non-continuous sequence within the sentence. Once identified, the entities that comprise a path sequence are demarcated and normalized, and the path is assigned a unique Id. Paths are useful when an analysis of just CRCs is not large enough to identify some meaningfully associated entities. Paths are especially useful when returning some smaller linguistic unit in a wider context.

Smart Indexing

Smart indexing is the process of translating unstructured text into a relational network of Concepts and Relations. You can index the contents of multiple unstructured texts, then analyze the resulting indexed entities according to user-defined query criteria, such as listing concepts in order of frequency. Each indexed entity can reference its source text, source sentence, and relational entities, such as its position in a CRC sequence. As part of smart indexing, NLP assigns two values to each indexed concept, specifying the total number of appearances of the concept in the texts (its frequency), and the number of different texts in which the concept appears (its spread).

Once you have performed smart indexing on multiple texts, NLP can use this information to analyze the source texts. For example, NLP can perform intelligent content browsing. From any selected NLP indexed item, you can browse to other items based on the degree of similarity between these items. Intelligent browsing can be performed within a source text or across all indexed source texts.

Once texts are indexed, NLP can generate summaries of individual texts. The user specifies the length of the summary as a percentage of the original text. NLP returns a summary text consisting of those sentences of the original text that are most relevant to the whole, based on index statistics. For example, if a text consists of 100 sentences, and the user specifies a 50% summary, NLP generates a summary text consisting of the 50 most relevant sentences from the original.

Smart Matching

Once NLP has indexed a collection of texts, it is possible to match items found in the texts with one or more user-defined match lists and to tag these matches. Smart matching performs high-precision tagging of concepts and phrases based on a semantic understanding of the complete context. Thus matches can occur between similar concepts or phrases, as well as full (identical) matches. Because this tagging is based on finding semantic matches, smart matching does not require any understanding of the text contents.

Once tagged, each appearance of a matched phrase in the texts remains associated with the tag text. These phrases can be matched as a single entity, a CRC, or a path. For example, the user could supply a list of the names of countries, so that each appearance of a country name in the texts is tagged for rapid access. You can build a dictionary of company names that you can match against analyst reports, allowing you to quickly find the latest news about the companies you're interested in. You can create a dictionary in which each appearance of specified medical procedure (phrased in various ways) is matched to a medical diagnostic code.

This dictionary matching is not limited to simple entities, but extends to CRCs and/or paths if the terms in the dictionary span more than one entity themselves. Because NLP indexes dictionary terms in the same way that it indexes source texts, a dictionary entry may be as long as a sentence. It may be useful to match a dictionary entry sentence against sources to locate similar information.

FeedbackOpens in a new tab