Conceptual Overview

The iKnow semantic analysis engine is used to analyze unstructured data, data that is written as text in a human language such as English or French. By providing the ability to rapidly access and analyze this type of data, iKnow allows you to work with all of your data. iKnow does not require you to have any prior knowledge of the contents of this data, or even know what language it is written in, so long as it is one of the languages that iKnow supports.

Commonly, unstructured data consists of multiple source texts, often a very large number of texts. A source text is commonly running text divided by punctuation into sentences. A source text can be a file, a record in a SQL result set, or a web source such as a series of blog entries.

A Simple Use Case

To see how iKnow handles unstructured data, consider the following sentence:

General Motors builds their Chevrolet Volt in the Detroit-Hamtramck assembly plant.

iKnow first divides a text into sentences; it then analyzes each sentence semantically. It does not need to “understand” or look up the words in the sentence. iKnow then indexes the resulting semantic entities from this sentence (normalizing letters to lower case):

[general motors] {builds} (their) [chevrolet volt] {in} the [detroit-hamtramck assembly plant]

From the initial sentence iKnow has identified the following entities:

3 Concepts: [general motors] [chevrolet volt] [detroit-hamtramck assembly plant].
2 Relations: {builds} {in}.
1 Path-relevant: (their). (Path-relevents are considered in path analysis, but are not indexed.
1 Non-relevant: the. (Non-relevants are discarded from further iKnow analysis.)

Note:

For the purpose of illustration, this example shows each Concept delimited by square brackets, each Relation delimited by curly braces, and each path-relevant delimited by parentheses; within iKnow such delimiter characters are not used.

iKnow assigns each entity a unique Id. iKnow also identifies sequences of these entities that follow the pattern Concept-Relation-Concept (CRC). In this example there are two CRCs:

[general motors] {builds} [chevrolet volt]
[chevrolet volt] {in} (their) [detroit-hamtramck assembly plant]

iKnow assigns each CRC a unique Id.

iKnow also recognizes that this sentence contains a continuous sequence of entities (in this case, the sequence CRCRC). A sequence of entities within a sentence that together express a single statement is known as a Path. The entire sentence may be a single Path, or may contain multiple Paths. iKnow assigns each Path a unique Id.

iKnow has now identified the sentences in the text, the relevant Entities in each sentence, which Entities are Concepts and which are Relations, which of these form CRC sequences, and which sequences of entities form a Path. By using these semantic units, iKnow can return many types of meaningful information about the contents of the source texts.

What is iKnow?

iKnow provides access to unstructured data by dividing up text into relational and associated entities and producing an index of these entities. It divides a text into sentences, then divides each sentence into a sequence of Concepts and Relations. It performs this operation by identifying the language of the text (for example, English), then applying the corresponding iKnow language model.

A Relation is a word or group of words that join two Concepts by specifying a relationship between them. iKnow contains a compact language model that is able to identify the Relations in a sentence.
A Concept is a word or group of words that is associated by a Relation. By determining what is a Relation, iKnow can identify associated Concepts. Thus the iKnow analysis engine can identify Concepts semantically without “understanding” their content.

Note:

For the purpose of explanation, verbs are commonly Relations and nouns with their associated adjectives are commonly Concepts. However, the linguistic model of Relations and Concepts is significantly more inclusive and more sophisticated than the distinction between verbs and nouns.

Thus iKnow divides a sentence into Concepts (C) and Relations (R). The language model uses a relatively small and fixed dictionary of relationship words and a set of context rules to identify Relations. Anything not identified as a Relation is considered a Concept. (iKnow also identifies non-relevant words, such as “the” and “a”, and discards them from further analysis.)

Relations and Concepts are collectively known as Entities. However, a Relation is almost never meaningful without an associated Concept. For this reason, iKnow entity analysis emphasizes Concepts and sequences that contain Concepts associated by a Relation.

Because iKnow analyzes text using a small and stable language model focused on identifying Relations, iKnow can rapidly index texts containing any subject matter. iKnow does not need to use a dictionary or ontology to identify Concepts.

Once iKnow has identified the Concepts and Relations in each sentence in a text, or (more commonly) in many texts, this information can be used to perform the following types of operations:

Smart Indexing: provides insight into what’s relevant, what’s related, and what’s representative from a large body of unstructured text.
Smart Matching: provides a means to associate entities in the source texts with external items such as lists or dictionaries. These lists can contain words, phrases, or sentences for full (identical) matching and partial matching, and can contain templates for matching by format.

What iKnow Isn’t

iKnow is not a search tool. Search tools enable you to locate only those things that you already believe are in the text. InterSystems provides the iFind tool as a search tool for unstructured text data in SQL tables. iFind uses many of the features of iKnow to provide intelligent text search.

iKnow is a content analysis tool. iKnow enables you to use the entire contents of the text data, including texts whose content is wholly unknown to you.

iKnow is not a dictionary-based content analyzer. Unlike dictionary-based tools, it does not break up sentences into individual words then attempt to “understand” those words and reconstruct context. iKnow simply identifies Entities semantically. It does not need to look up these Entities in a dictionary or ontology. For this reason its language model is compact, stable, and general-purpose; you do not have to specify any information about the type of texts being analyzed (medical, legal, etc.), or provide a separate dictionary of relevant terms. While iKnow can be extended by associating a dictionary or ontology of terms, its essential functions do not require one. Thus it does not require the creation, customizing, or periodic updating of a dictionary.

iKnow supports stemming, but is not by default a stemming tool. By default it does not reduce relations or concepts to stem forms. Instead, it treats each element as a distinct entity, then identifies its degree of similarity to other elements. iKnow supports stemming as an optional feature; it is recommended primarily for use with Russian and Ukrainian text sources. Caché also provides a set of classes you can use to perform stemming: the %TextOpens in a new tab package, as described in the InterSystems Class Reference. %TextOpens in a new tab and %iKnow are wholly independent of each other and are used for different purposes.

Logical Text Units Identified by iKnow

Sentences

iKnow uses a language model to divide the source text into sentences. In general, iKnow defines a sentence as a unit of text ending with a sentence terminator (usually a punctuation mark) followed by at least one space or line return. The next sentence begins with the next non-whitespace character. Capitalization is not required to indicate the beginning of a sentence.

Sentence terminators are (for most languages) the period (.), question mark (?), exclamation mark (!), and semi-colon (;). A sentence can be terminated by more than one terminator, such as ellipsis (...) or emphatic punctuation (??? or !!!). Any combination of terminators is permitted (...!?). A blank space between sentence terminators indicates a new sentence; therefore, an ellipsis containing spaces (. . .) is actually three sentences. A sentence terminator must be followed by either a whitespace character (space, tab, or line return), or by a single quote or double quote character, followed by a whitespace character. For example, "Why?" he asked. is two sentences, but "Why?", he asked. is a single sentence.

A double line return acts as a sentence terminator with or without a sentence terminator character. Therefore, a title or a section heading is considered to be a separate sentence, if followed by a blank line. The end of the file is also treated as a sentence terminator. Therefore, if a source contains any content at all (other than whitespace) it contains at least one sentence, regardless of the presence of a sentence terminator. Similarly, the last text in a file is treated as a separate sentence, regardless of the presence of a sentence terminator.

A period followed by a blank space usually indicates a sentence break, though iKnow language models recognize exceptions to this rule. For example, the English language model recognizes common abbreviations, such as “Dr.” and “Mr.” (not case-sensitive) and removes the period rather than performing a sentence break. The English language model recognizes “No.” as an abbreviation, but treats lowercase “no.” as a sentence terminator.

You can use the UserDictionary option of your Configuration to cause or avoid sentence endings in specific cases. For example, the abbreviation “Fr.” (Father or Friar) is not recognized by the English language model. It is treated as a sentence break. You can use a UserDictionary to either remove the period or to specify that this use of a period should not cause a sentence break. A UserDictionary is applied as a source is loaded; already loaded sources are not affected.

Entities

An entity is a minimal logical unit of text. It is either a word or a group of words that iKnow logically groups together into either a concept or a relation. Other logical units, such as a telephone number or an email address, are also considered entities (and are treated as concepts).

Note:

Japanese text cannot be divided into concepts and relations. Instead iKnow analyzes Japanese text as a sequence of entities with associated particles. The definition of an “entity” for Japanese is roughly equivalent to a Concept in other iKnow languages. For a description of iKnow Japanese support (written in Japanese) refer to iKnow JapaneseOpens in a new tab.

iKnow normalizes entities so that they may be compared and counted. It removes non-relevant words. It translates entities into lower case letters. It removes most punctuation and some special characters from entities.

By default, iKnow restricts its analysis of entities to Concepts. By default, Relations are only analyzed because of their role in linking Concepts together. This default can be overridden, as described in the “Limiting by Position” section of the iKnow Queries chapter.

Path-relevant Words

iKnow identifies certain words in each language as being an essential part of its analysis of sentences and paths, but otherwise not relevant. Outside of the context of a sentence or path, these words have little informational content. The following are typical path-relevant words:

Pronouns of all types: definite, indefinite, possessive.
Indefinite expressions of time, frequency, or place. For example, “then”, “soon”, “later”, “sometimes”, “all”, “here”.

A word is only considered a path-relevant if it is not part of a Concept or a Relation. For example: “He said this was his” contains path-relevants; “His teacher said this signature was his name” does not contain path-relevants.

Path-relevant words are not considered Concepts, nor are they counted in frequency or dominance calculations. Path-relevant words may be negation or time attribute markers. Path-relevant words are not stemmed.

Non-relevant Words

iKnow identifies certain words in each language as being non-relevant, and excludes these words from iKnow indexing. There are several kinds of non-relevant words:

Articles (such as “the” and “a”) and other words that the iKnow language model identifies as having little or no semantic importance.
Prefatory words or phrases at the beginning of a sentence, such as “And”, “Nevertheless”, “However”, “On the other hand”.
Character strings over 150 characters that are unbroken by spaces or sentence punctuation. A “word” of this length is highly likely to be a non-text entity, and is thus excluded from iKnow indexing. Because in rare cases (such as chemical nomenclature or URL strings) these 150+ character words are semantically relevant, iKnow flags them with the attribute “nonsemantic”.

Non-relevant words are excluded from iKnow indexing, but are preserved when sentences are displayed.

CRCs and CCs

Once iKnow divides a sentence into Concepts (C) and Relations (R), it can determine several types of connections between these fundamental entities.

CRC is a Concept-Relation-Concept sequence. A CRC is handled as a Master Concept - Relation - Slave Concept sequence. Whether an entity is a Master, Relation, or Slave is known as its position. In some cases, a CRC may have an empty string value for one of the sequence members (CR or RC); this can occur, for example, when the Relation of the CRC is an intransitive verb: “Robert slept.”
CC is a Concept + Concept pair. iKnow retains the position of each Concept, but ignores the Relation between the two Concepts. A CC can either be handled as two associated Concepts, or as a Master Concept/Slave Concept sequence. You can use CC pairs to identify associated Concepts without regard to their master/slave positions or the linking Relation. This is especially useful when determining a network of Concepts — what Concepts have a connection to what other Concepts. You can also use CC pairs as a master/slave sequence.

Note:

Japanese cannot be analyzed semantically in terms of CRCs or CCs because iKnow does not divide Japanese entities into concepts and relations.

Paths

A Path is a meaningful sequence of Entities through a sentence. In Western languages, Paths are commonly based on sequential CRCs, thus resulting Paths have the entities (Concepts & Relations) in their original sentence order. Commonly, though not exclusively, this takes the form of a continuous sequence of CRCs. For example, in a common path sequence the Slave Concept of one CRC becomes the Master Concept of the next CRC. This results in a path consisting of five entities: C-R-C-R-C. Other meaningful sequences of Concepts and Relations are also treated as paths, such as a sequence that contains a path-relevant pronoun as a stand-in for a Concept.

In Japanese, Paths cannot be based on the sequence of Entities in the original sentence. iKnow nevertheless does identify Paths as meaningful sequences of Entities within Japanese text. iKnow semantic analysis of Japanese uses an entity vector algorithm to create Entity Vectors. When iKnow converts a Japanese sentence into an Entity Vector it commonly lists the Entities in a different order than the original sentence to indicate which Entities are linked to each other and how strong the link between them is. The resulting Entity Vector is used for Path analysis.

A Path must contain at least two Entities. Not all sentences are paths; some very short sentences may not contain the minimum number of Entities to qualify as a path.

A path is always contained within a single sentence. However, a sentence may contain more than one path. This can occur when iKnow identifies a non-continuous sequence within the sentence. Once identified, the entities that comprise a path sequence are demarcated and normalized, and the path is assigned a unique Id. Paths are useful when an analysis of just CRCs is not large enough to identify some meaningfully associated entities. Paths are especially useful when returning some smaller linguistic unit in a wider context.

Smart Indexing

Smart indexing is the process of translating unstructured text into a relational network of Concepts and Relations. You can index the contents of multiple unstructured texts, then analyze the resulting indexed entities according to user-defined query criteria, such as listing concepts in order of frequency. Each indexed entity can reference its source text, source sentence, and relational entities, such as its position in a CRC sequence. As part of smart indexing, iKnow assigns two values to each indexed concept, specifying the total number of appearances of the concept in the texts (its frequency), and the number of different texts in which the concept appears (its spread).

Once you have performed smart indexing on multiple texts, iKnow can use this information to analyze the source texts. For example, iKnow can perform intelligent content browsing. From any selected iKnow indexed item, you can browse to other items based on the degree of similarity between these items. Intelligent browsing can be performed within a source text or across all indexed source texts.

Once texts are indexed, iKnow can generate summaries of individual texts. The user specifies the length of the summary as a percentage of the original text. iKnow returns a summary text consisting of those sentences of the original text that are most relevant to the whole, based on index statistics. For example, if a text consists of 100 sentences, and the user specifies a 50% summary, iKnow generates a summary text consisting of the 50 most relevant sentences from the original.

Smart Matching

Once iKnow has indexed a collection of texts, it is possible to match items found in the texts with one or more user-defined match lists and to tag these matches. Smart matching performs high-precision tagging of concepts and phrases based on a semantic understanding of the complete context. Thus matches can occur between similar concepts or phrases, as well as full (identical) matches. Because this tagging is based on finding semantic matches, smart matching does not require any understanding of the text contents.

Once tagged, each appearance of a matched phrase in the texts remains associated with the tag text. These phrases can be matched as a single entity, a CRC, or a path. For example, the user could supply a list of the names of countries, so that each appearance of a country name in the texts is tagged for rapid access. You can build a dictionary of company names that you can match against analyst reports, allowing you to quickly find the latest news about the companies you're interested in. You can create a dictionary in which each appearance of specified medical procedure (phrased in various ways) is matched to a medical diagnostic code.

This dictionary matching is not limited to simple entities, but extends to CRCs and/or paths if the terms in the dictionary span more than one entity themselves. Because iKnow indexes dictionary terms in the same way that it indexes source texts, a dictionary entry may be as long as a sentence. It may be useful to match a dictionary entry sentence against sources to locate similar information.