NLP Queries

The NLP semantic analysis engine supplies a large number of InterSystems IRIS® data platform query APIs which are used to return text entities and statistics about these text entities. For example, the %iKnow.Queries.CrcAPI.GetTop()Opens in a new tab method returns the most frequently occurring CRCs in a specified domain. The %iKnow.Queries.CrcAPI.GetCountBySource()Opens in a new tab returns the total number of unique CRCs that appear in the specified sources.

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

Types of Queries

There are three types of queries provided. They are distinguished by their name suffixes:

API: ObjectScript queries
QAPI: InterSystems SQL queries
WSAPI: SOAP-accessible Web Services queries

For each of these types, NLP provides queries for:

Entities: return all entities in a source or multiple sources; the most frequently occurring entities; entities similar to a supplied string, etc.
CCs: return concept-concept pairs.
CRCs: return concept-relation-concept (head-relation-tail) sequences.
Paths: return chain of concept-relation-concept sequences within a sentence. A path contains a minimum of two CRCs (CRCRC).
Sentences: return sentences that contain a specified CRC, entity, etc.
Sources: return sources that contain a specified CRC, entity, etc.

Queries Described in this Chapter

This chapter describes and provides examples of many commonly-used NLP queries:

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

Note that the query examples in this chapter use the default Configuration. Queries that you write may require a specified Configuration to establish the language environment.

Query Method Parameters

The following parameters are common to many query methods:

domainid: The domain ID is an integer that identifies the domain.
result: If a query returns an array of values rather than just a single result value, the result set is passed by reference (using the dot prefix operator, for example, .result). You can then use ZWRITE to display the whole result set in raw InterSystems IRIS list format, or use a loop structure and list-to-string conversion to return one row at a time, as shown in the examples in this chapter.
page and pagesize (optional): To prevent methods from retrieving and returning thousands of records, the Query API uses a paging mechanism to allow the user to limit the number of results returned. It divides the results into equal-length pages, with the length of each page specified as the pagesize. For example, if you want the first ten results, you specify page 1 and a pagesize of 10. If you want the next page of results, you specify page 2 and pagesize 10. The default values are page=1 and pagesize=10.
setop (optional): If a query applies more than one selection criteria, the Setop logical operator specifies whether the query should return the union or the intersection of the result sets. 1 ($$$UNION) returns results that match any of the supplied selection criteria. 2 ($$$INTERSECT) returns results that match all of the supplied selection criteria. The default is $$$UNION.
entitylist: In queries that return matches to an entity (for example, GetByEntities(), GetRelated(), GetSimilar()) an InterSystems IRIS list of entities. You can specify entitylist entities in any mix of uppercase and lowercase letters; NLP matches them against indexed entities normalized to lowercase.
vSrcId: The source Id of a virtual source, specified as a negative integer. If specified, only entities in that virtual source are processed by the query. If omitted or specified as 0, only ordinary sources are considered by the query and virtual sources are ignored. The default is 0.

Counting Sources and Sentences

To count the number of sources loaded, you can use the GetCountByDomain()Opens in a new tab method of the %iKnow.Queries.SourceAPIOpens in a new tab class.

To count the sentences in all of the sources loaded, you can use the GetCountByDomain()Opens in a new tab method of the %iKnow.Queries.SentenceAPIOpens in a new tab class. To count the sentences in a single source, you can use the GetCountBySource()Opens in a new tab method.

The following example uses data loaded from .txt files (such as source1.txt, source2.txt, etc.) in the mytextfiles directory to demonstrate these sentence count methods. The default Configuration is used:

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE 
     { SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET mylister=##class(%iKnow.Source.File.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET stat=myloader.SetLister(mylister)
  SET install=$SYSTEM.Util.DataDirectory()
  SET dirpath=install_"mgr\Temp\iknow\mytextfiles"
  SET stat=myloader.ProcessList(dirpath,$LB("txt"),0,"")
  IF stat '= 1 { WRITE "Loader error ",$System.Status.DisplayError(stat)
                     QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,20)
  SET i=1
  WHILE $DATA(result(i)) {
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE fname," has ",numSentS," sentences",!
     SET i=i+1 }

The following example uses data loaded from a field of the Aviation.Event SQL table to demonstrate these sentence count methods. In this example only a sample of 10 data records (TOP 10) are loaded:

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 10 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceQAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,20)
  SET i=1
  WHILE $DATA(result(i)) {
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE fname," has ",numSentS," sentences",!
     SET i=i+1 }

For details on what NLP considers a sentence, refer to the Logical Text Units Identified by NLP section of the “Conceptual Overview” chapter.

Counting Entities

To count the number of sources that contain one or more occurrences of a specified entity, you can use the GetCountByEntities()Opens in a new tab method of the %iKnow.Queries.SourceAPIOpens in a new tab class. In this method you can specify a list on one or more entities to search for in the loaded sources.

Note that here, and throughout NLP, the concept of “entity” differs significantly from the familiar notion of a search term. For example, the entity “dog” does not occur in the sentence “The quick brown fox jumped over the lazy dog.” The entity “lazy dog” does occur in this sentence. An entity can be a concept or a relation; you could, for example, count the number of sources that contain the entity “is” or the entity “jumped over”. However, in these examples and in most real-world cases, NLP matches concepts or concepts associated by a relation.

The following example demonstrates these query count methods:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
SingleEntityCounts
  SET ent=$LB("NTSB","National Transportation Safety Board",
    "NTSB investigator-in-charge","NTSB oversight","NTSB's Materials Laboratory",
    "FAA","Federal Aviation Administration","FAA inspector")
  SET entcnt=$LISTLENGTH(ent)
  SET ptr=0
  FOR x=1:1:entcnt {
   SET stat=$LISTNEXT(ent,ptr,val)
   WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByEntities(domId,val)," contain ",val,!
   }
   WRITE "end of listing"

Listing Top Entities

NLP has three query methods you can use to return the “top” entities in the source documents of a domain:

GetTop()Opens in a new tab list the most-frequently-occurring entities in descending order by frequency count (by default). It can also be used to list most-frequently-occurring entities by spread. It provides the frequency and spread for each entity. This is the most basic listing of top entities.
GetTopTFIDF()Opens in a new tab lists the top entities using a frequency-based metric similar to the TFIDF score. It calculates this score by combining an entity’s Term Frequency (TF) with its Inverse Document Frequency (IDF). The Term Frequency counts how often the entity appears in a single source. The Inverse Document Frequency is based on the inverse of the spread (also known as “document frequency”) of an entity. It uses this IDF frequency to diminish the Term Frequency. Thus an entity that appears multiple times in a small percentage of the sources is given a high TFIDF score; an entity that appears multiple times in a large percentage of the sources is given a low TFIDF score.
GetTopBM25()Opens in a new tab lists the top entities using a frequency-based metric similar to the Okapi BM25 algorithm, which combines an entity's Term Frequency with its Inverse Document Frequency (IDF), taking into account document length.

All three of these methods return top Concepts by default, but can be used to return top Relations. All three of these methods can apply a filter to limit the scope of sources used.

The GetTop() method ignores entities of less than three characters. The GetTopTFIDF() and GetTopBM25() methods can return 1-character and 2-character entities.

GetTop(): Most-Frequently-Occurring Entities

An NLP query can return the most frequently occurring entities in the source documents in descending order of frequency or spread. Each entity is returned as a separate record in InterSystems IRIS list format.

The entity record format is as follows:

The entity ID, a unique integer assigned by NLP.
The entity value, specified as a string.
Frequency: an integer count of how many times the entity occurs in the source documents.
Spread: an integer count of how many source documents contain the entity.

The following query returns the most frequent (top) entities in the sources loaded by this program. By default these are Concept entities. It sets the page (1) and pagesize (50) parameters to specify how many entities to return. It returns (at most) the top 50 entities. It uses the domain default sorttype, which is in descending order by frequency:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,50)
  SET i=1
  WHILE $DATA(result(i)) {
       SET outstr = $LISTTOSTRING(result(i),",",1)
         SET entity = $PIECE(outstr,",",2)
         SET freq = $PIECE(outstr,",",3)
         SET spread = $PIECE(outstr,",",4)
       WRITE "[",entity,"] appears ",freq," times in ",spread," sources",!
       SET i=i+1 }
  WRITE "Printed the top ",i-1," entities"

The following GetTop() method returns the top entities by spread:

  DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,50,,,$$$SORTBYSPREAD)

GetTopTFIDF() and GetTopBM25()

These two methods return a list of top entities in descending order by a calculated score. By default these are Concept entities. Because they are using different algorithms to assign a score to an entity, the list of “top” entities may differ significantly. For example, the following table shows the relative order of four entities in the Aviation.Event database when analyzed using different methods:

	“airplane”	“helicopter”	“flight instructor”	“student pilot”
GetTop()	1st	12th	17th	43rd
GetTopTFIDF()	(not in listing)	1st	4th	22nd
GetTopBM25()	(not in listing)	3rd	2nd	1st

The top 5 entities in the Aviation.Event database returned by GetTop() are: “airplane”, “pilot”, “engine”, “flight”, and “accident”. All of these entities occur at least once in more than half of the sources. While these are frequently-occurring entities, they are of little value in determining the contents of specific sources. An entity that occurs in more than half of the sources is given a negative IDF value. For this reason, none of these entities appear in the GetTopTFIDF() and GetTopBM25() listings.

The following example list the top 50 entities using GetTopTFIDF():

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetTopTFIDF(.result,domId,1,50)
  SET i=1
  WHILE $DATA(result(i)) {
       SET outstr = $LISTTOSTRING(result(i),",",1)
         SET entity = $PIECE(outstr,",",2)
         SET score = $PIECE(outstr,",",3)
       WRITE "[",entity,"] has a TFIDF score of ",score,!
       SET i=i+1 }
  WRITE "Printed the top ",i-1," entities"

The following example list the top 50 entities using GetTopBM25():

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetTopBM25(.result,domId,1,50)
  SET i=1
  WHILE $DATA(result(i)) {
       SET outstr = $LISTTOSTRING(result(i),",",1)
         SET entity = $PIECE(outstr,",",2)
         SET score = $PIECE(outstr,",",3)
       WRITE "[",entity,"] has a BM25 score of ",score,!
       SET i=i+1 }
  WRITE "Printed the top ",i-1," entities"

CRC Queries

An NLP query that returns a CRC (Concept-Relation-Concept sequence) returns it in the following format:

The CRC ID, a unique integer assigned by NLP.
The Head Concept, specified as a string.
The Relation, specified as a string.
The Tail Concept, specified as a string.
Frequency: an integer count of how many times the CRC occurs in the source documents.
Spread: an integer count of how many source documents contain the CRC.

Listing CRCs that Contain Entities

One common use of CRCs is to specify an entity (usually a Concept) and return the CRCs that contain that entity. This provides the various contexts in which an entity appears in a source (or sources). Because NLP normalizes all text to lowercase letters, you must specify these matching entities in lowercase.

The following query returns all of the CRCs that contain the specified Concepts (“left wing”, "right wing", "wings", "leading edge", and "trailing edge") as either the head concept or the tail concept of a CRC. Note that the GetByEntities() method page argument has been set to 25 to return more CRCs; it defaults to 10.

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
CRCQuery
  SET myconcepts=$LB("left wing","right wing","wings","leading edge","trailing edge")
  DO ##class(%iKnow.Queries.CrcAPI).GetByEntities(.result,domId,myconcepts,1,25)
  SET i=1
  WHILE $DATA(result(i)) {
     SET mycrcs=$LISTTOSTRING(result(i),",",1)
     WRITE "[",$PIECE(mycrcs,",",2,4),"]"
     WRITE "  appears ",$PIECE(mycrcs,",",5)," times in "
     WRITE $PIECE(mycrcs,",",6)," sources",!
     SET i=i+1 }
  WRITE !,"End of listing"

Counting Sources that Contain a CRC

The following program example returns the count of sources that contain the specified CRCs. To specify CRCs to the GetCountByCrcs()Opens in a new tab method, you must specify each CRC as a %List (using $LB), and then group these CRCs together as a %List. This is shown in the following example:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
CRCCount
  SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
  SET mycrcs=$LB($LB("leading edge","of","wing"),$LB("leading edge","of","right wing"),
             $LB("leading edge","of","left wing"),$LB("leading edges","of","wings"),
             $LB("leading edges","of","both wings"))
  SET numSrc=##class(%iKnow.Queries.SourceAPI).GetCountByCrcs(domId,mycrcs)
  WRITE "From ",numSrcD," indexed sources there are ",!
  WRITE numSrc," sources containing one or more of the following CRCs:",!
  FOR i=1:1:$LISTLENGTH(mycrcs) {
      WRITE $LISTTOSTRING($LIST(mycrcs,i)," "),!
  }

The GetCountByCrcs() method returns the count of sources that contain any of the specified CRCs.

Listing Sources or Sentences that Fulfill a CRC Mask

You can use a CRC mask to specify an entity value for a specific CRC position. Each CRC has three positions: head, relation, and tail. With a CRC mask you can specify either an entity value or a wildcard for each position. A CRC mask enables you to list sources or sentences that contain CRCs that match one or more positional values. Because it specifies both position and entity value, the GetByCrcMask() partial CRC match is a more restrictive match than GetByEntities(), but a less restrictive match than GetByCrcs().

The following example uses a CRC mask that matches the entity “student pilot” in head position, while using wildcards to permit any value in the CRC relation and tail positions. The GetByCrcMask()Opens in a new tab method matches this mask against every sentence in each source, and returns the sentence Id and the sentence text of those sentences that contain a CRC with “student pilot” in the head position.

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
CRCMaskSentencesBySource
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,100)
  SET i=1
  WHILE $DATA(result(i)) {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET srcname = $PIECE($PIECE(extId,":",3,4),"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE numSentS," sentences in ",srcname,!
     SET stat=##class(%iKnow.Queries.SentenceAPI).GetByCrcMask(.sentresult,domId,"student pilot",
              $$$WILDCARD,$$$WILDCARD,srcId)
     SET i=i+1
     FOR j=1:1:20 {
         IF $DATA(sentresult(j)) {
         SET sent = $LISTTOSTRING(sentresult(j),",",1)
         SET sentId = $PIECE(sent,",",3)
         WRITE "The SentenceId is ",sentId," in source ",srcname,":",!
         WRITE "  ",##class(%iKnow.Queries.SentenceAPI).GetValue(domId,sentId),!
         }
         ELSE { WRITE "Listed ",j-1," sentence that match the CRC mask",!!
         QUIT }
     }
  }

Listing Similar Entities

You can list the unique entities that are similar to a specified string. An entity is similar if one of the following applies:

The string is identical to the entity.
The string is one of the words of the entity.
The string is the first letters of one of the words of the entity.

Similarity returns each unique entity (Head Concept or Tail Concept) with integer counts of its frequency and spread, in descending sort order of these integer counts. Similarity does not match Relations. As is true throughout NLP, matching ignores letter case; all entities are returned in lowercase letters. Similarity does not use stemming logic; “cat” returns both “cats” and “category”.

The following example lists the entities that are similar to the string “student pilot”:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
SimilarEntityQuery
  WRITE "Entities similar to 'Student Pilot':",!
  DO ##class(%iKnow.Queries.EntityAPI).GetSimilar(.simresult,domId,"student pilot",1,50)
  SET j=1
  WHILE $DATA(simresult(j)) {
       SET outstr = $LISTTOSTRING(simresult(j),",",1)
         SET entity = $PIECE(outstr,",",2)
         SET freq = $PIECE(outstr,",",3)
         SET spread = $PIECE(outstr,",",4)
       WRITE "(",entity,")  appears ",freq," times in ",spread," sources",!
      SET j=j+1 }

The default domain parameter setting governing entity similarity is EnableNgrams, a boolean value.

Parts and N-grams

The GetSimilar()Opens in a new tab and GetSimilarCounts()Opens in a new tab methods have a mode parameter that specifies where to search for similarity. There are two available values:

$$$USEPARTS causes NLP to match the beginning of each part (word) for similarity. For texts in English and most other languages this is generally the preferred setting. $$$USEPARTS is the default.
$$$USENGRAMS causes NLP to match words and linguistic units within words (n-grams) for similarity. This mode is used when the source text language compounds words. For example, $$$USENGRAMS would commonly be used with German, a language which regularly forms compound words. $$$USENGRAMS would not be used with English, a language which does not compound words. $$$USENGRAMS can only be used in a domain that has the EnableNgrams domain parameter set.

Listing Related Entities

An entity is related to another entity if both occur in a CRC. By default, the related entity can be either a head concept or a tail concept. (Refer to “Limiting by Position” (below) to override this default.)

The following example shows how NLP returns related entities. It first determines how many CRCs contain the entity “student pilot” and lists these CRCs. (In this small example, you can simply read all the CRCs to see what is related to “student pilot”; in a much larger collection of sources this would not be practical.) The program example then lists all of the entities that are related to “student pilot” as either tail or head (you can confirm these relations by matching these entities against the CRCs listed earlier):

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
ContainCRCQuery
  SET crccount = ##class(%iKnow.Queries.CrcAPI).GetCountByEntities(domId,$LB("student pilot"))
        WRITE crccount," CRCs contain 'student pilot'",!
  DO ##class(%iKnow.Queries.CrcAPI).GetByEntities(.result,domId,$LB("student pilot"),1,crccount)
  SET i=1
  WHILE $DATA(result(i)) {
    WRITE $LISTTOSTRING(result(i),",",1),!
    SET i=i+1 }
  SET relcount = ##class(%iKnow.Queries.EntityAPI).GetRelatedCount(domId,$LB("student pilot"))
  WRITE !,relcount," entities are related to 'student pilot':",!
RelatedEntityQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetRelated(.rresult,domId,$LB("student pilot"),1,relcount)
  SET j=1
  WHILE $DATA(rresult(j)) {
      WRITE $LISTTOSTRING(rresult(j),",",1),!
    SET j=j+1 }

Limiting by Position

The position of an entity can be Head Concept, Relation, or Tail Concept. By default, the GetRelated()Opens in a new tab method returns all related concepts regardless of position and does not return relations. You can change this default by specifying a macro constant for the 8th parameter (positiontomatch). The available constants are as follows:

Constant	Value	Meaning
$$$USEPOSM	1	Head Concepts
$$$USEPOSR	2	Relations
$$$USEPOSMR	3	Head Concepts and Relations
$$$USEPOSS	4	Tail Concepts
$$$USEPOSMS (the default)	5	Head Concepts and Tail Concepts
$$$USEPOSRS	6	Relations and Tail Concepts
$$$USEPOSALL	7	Head Concepts, Relations, and Tail Concepts

The following example separates the related head concepts and the related tail concepts. (Note that $$$USEPOSM means that the supplied string is the head concept in the CRC, and the related entities are the tail concepts.)

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
ContainCRCQuery
  SET crccount = ##class(%iKnow.Queries.CrcAPI).GetCountByEntities(domId,$LB("student pilot"))
        WRITE crccount," CRCs contain 'student pilot'",!
  DO ##class(%iKnow.Queries.CrcAPI).GetByEntities(.result,domId,$LB("student pilot"),1,crccount)
  SET i=1
  WHILE $DATA(result(i)) {
    WRITE $LISTTOSTRING(result(i),",",1),!
    SET i=i+1 }
  SET relcount = ##class(%iKnow.Queries.EntityAPI).GetRelatedCount(domId,$LB("student pilot"))
  WRITE !,relcount," entities are related to 'student pilot':",!

ListRelatedHeadsQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetRelated(.mresult,domId,$LB("student pilot"),1,relcount,"","",$$$USEPOSM)
   WRITE !,"The following have 'student pilot' as a head:",!
   SET j=1
   WHILE $DATA(mresult(j)) {
      WRITE $LISTTOSTRING(mresult(j),",",1),!
      SET j=j+1 }
ListRelatedTailsQuery
  DO ##class(%iKnow.Queries.EntityAPI).GetRelated(.sresult,domId,$LB("student pilot"),1,relcount,"","",$$$USEPOSS)
   WRITE !,"The following have 'student pilot' as a tail:",!
   SET k=1
   WHILE $DATA(sresult(k)) {
      WRITE $LISTTOSTRING(sresult(k),",",1),!
      SET k=k+1 }

Counting Paths

The following example shows the count of paths and the count of sentences for 50 sources. Commonly there are more paths than sentences in a source. However, it is possible that there may be more sentences than paths in some sources.

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 50 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," total sources",!!
PathCountBySource
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,50)
  SET i=1
  WHILE $DATA(result(i)) {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numPathS = ##class(%iKnow.Queries.PathAPI).GetCountBySource(domId,$LB(srcId))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
     WRITE numPathS," paths and ",numSentS," sentences in ",fname,!
     SET i=i+1 }

Listing Similar Sources

The NLP semantic analysis engine can list which sources are similar to a specified source. Similarity between sources is determined by the number of entities that appear in both sources (the overlap), and the percentage of the source contents that contain overlap.

The GetSimilar()Opens in a new tab method can calculate similarity of sources to a specified source. Because of the potentially large number of similar sources, this method is commonly used with a filter to limit the set of sources considered. GetSimilar() can use your choice of two algorithms, each of which takes an algorithm parameter:

Basic similarity of items ($$$SIMSRCSIMPLE, the default). Available algorithm parameters are “ent” (entity similarity, the default), “crc” (Concept-Relation-Concept sequence), or “cc” (Concept + Concept pair).
Using semantic dominance calculations ($$$SIMSRCDOMENTS). The algorithm parameter is a boolean flag that specifies limiting similarity to sources that contain a dominant entity that is also a dominant entity in the specified source.

For each similar source, NLP returns a list of elements with the following format:

srcId,extId,percentageMatched,percentageNew,nbOfEntsInRefSrc,nbOfEntsInCommon,nbOfEntsInSimSrc,score

Element	Description
srcId	The source ID, an integer assigned by NLP.
extId	The external ID for the source, a string value.
percentageMatched	The percentage of the contents of the source that is the same as the match source.
percentageNew	The percentage of the contents of the source that is new. New contents are those that do not match with the match source.
nbOfEntsInRefSrc	The number of unique entities in the source being referenced (matched against this source).
nbOfEntsInCommon	The number of unique entities that are found in both sources.
nbOfEntsInSimSrc	The number of unique entities in this source.
score	The similarity score, expressed as a fractional number. An identical source would have a similarity score of 1.

The following example demonstrates the listing of similar sources. It first limits the set of test sources to those that may describe an engine failure incident, by using GetByEntities() to select for a list of appropriate entities. It then uses GetSimilar() to find sources similar to these test sources, which may indicate a pattern of similar incidents. GetSimilar() takes the default similarity algorithm ($$$SIMSRCSIMPLE) and its default algorithm parameter (“ent”). The program displays only those similar sources with a high similarity score (>.33). The similarity display omits the source external IDs:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
  SET totsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE totsrc," total sources",!
SimiarSourcesQuery
  SET engineents = $LB("engine","engine failure","engine power","loss of power","carburetor","crankshaft","piston")
  DO ##class(%iKnow.Queries.SourceAPI).GetByEntities(.result,domId,engineents,1,totsrc)
  SET i=1
  WHILE $DATA(result(i)) {
      SET src = $LISTTOSTRING(result(i),",",1)
      SET srcId = $PIECE(src,",",1)
      WRITE "Source ",srcId," contains an engine incident",!
      DO ##class(%iKnow.Queries.SourceAPI).GetSimilar(.sim,domId,srcId,1,50,"",$$$SIMSRCSIMPLE,$LB("ent"))
      SET j=1
      WHILE $DATA(sim(j)) {
          SET simlist=$LISTTOSTRING(sim(j))
          IF $PIECE(simlist,",",8) > .33 {
              WRITE "   similar to source ",$PIECE(simlist,",",1),": "
              WRITE $PIECE(simlist,",",3,8),! }
          SET j=j+1 }
  SET i=i+1 }

Summarizing a Source

The NLP semantic analysis engine can summarize a source text by returning the most relevant sentences. It returns a user-specified number of sentences in the original sentence order, selecting those sentences that have the highest similarity to the overall content of the source text. NLP determines relevance by calculating an internal relevancy score for each sentence. Sentences that contain concepts that appear many times in the source text are more likely to be included in the summary than those that contain concepts that only appear once in the source text. NLP considers the overall frequency of each concept, the similarity of each concept to the most frequent concepts in the source, and other factors.

Summarizing a source is only available if the Summarize property was set to 1 in the Configuration when loading the source. The default Configuration specifies Summarize=1.

The accuracy of a summary therefore depends on two factors:

The source text must be large enough to permit meaningful frequency analysis, but not too large. NLP summarization works best on texts the length of a chapter or article. A book-length text should be summarized chapter-by-chapter.
The number of sentences in the summary should be a large enough subset of the original for the returned sentences to form a readable summary text. The minimum summary percentage is between 25% and 33%, depending on the contents of the text.

NLP provides three summary methods:

GetSummary()Opens in a new tab which returns each sentence of the summary text as a separate result. The sentence Id is returned as the first element of each returned sentence.
GetSummaryDirect()Opens in a new tab which returns the summary text as a single string. By default, the sentences within this string are separated by an ellipsis: a space followed by three periods, followed by a space. For example, “This is sentence one. ... This is sentence two.” You may specify a different sentence separator, if desired. Because this method concatenates multiple sentences into a single string, it may attempt to create a string longer than the InterSystems IRIS string length limit. When the maximum string length is reached, InterSystems IRIS sets this method’s isTruncated boolean output parameter to 1, and truncates the remaining text.
GetSummaryForText()Opens in a new tab which supports compiling a summary of a user-supplied string directly, rather than by supplying a specific source ID.

For details on what NLP considers a sentence, refer to the Logical Text Units Identified by NLP section of the “Conceptual Overview” chapter.

The following example goes through the source texts in a domain until it finds one that contains more than 100 sentences. It then uses GetSummary() to summarize that source to half of its original sentences:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceTotals
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
SentenceCounts
  FOR i=1:1:numSrcD {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
       IF numSentS > 100 {WRITE fname," has ",numSentS," sentences",!
                          GOTO SummarizeASource }
     }
     QUIT
SummarizeASource
   SET sumlen=$NUMBER(numSentS/2,0)
   WRITE "total sentences=",numSentS," summary=",sumlen," sentences",!!
   DO ##class(%iKnow.Queries.SourceAPI).GetSummary(.sumresult,domId,srcId,sumlen)
   FOR j=1:1:sumlen { WRITE "[S",j,"]: ",$LISTGET(sumresult(j),2),! }
   WRITE !,"END OF ",fname," SUMMARY",!!
   QUIT

Note that $NUMBER is used to assure that the specified summary sentence count is an integer. $LISTGET is used to remove the sentence Id and return just the sentence text.

The following example uses GetSummaryDirect() to return the same summary as a single concatenated string. It then uses $EXTRACT to divide the string into 38-character lines for display purposes:

#include %IKPublic
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO ListerAndLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO ListerAndLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
ListerAndLoader
  SET domId=domoref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceSentenceTotals
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences",!!
  DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId)
SentenceCounts
  FOR i=1:1:numSrcD {
     SET srcId = $LISTGET(result(i),1)
     SET extId = $LISTGET(result(i),2)
     SET fullref = $PIECE(extId,":",3,4)
     SET fname = $PIECE(fullref,"\",$LENGTH(extId,"\"))
     SET numSentS = ##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,result(i))
       IF numSentS > 100 {WRITE fname," has ",numSentS," sentences",!
                          GOTO SummarizeASource }
     }
     QUIT
SummarizeASource
   SET sumlen=$NUMBER(numSentS/2,0)
   WRITE "total sentences=",numSentS," summary=",sumlen," sentences",!!
   SET summary = ##class(%iKnow.Queries.SourceAPI).GetSummaryDirect(domId,srcId,sumlen)
FormatSummaryDisplay
   SET x=1
   SET totlines=$LENGTH(summary)/38
   FOR i=1:1:totlines {
      WRITE $EXTRACT(summary,x,x+38),!
      SET x=x+39 }
   WRITE !,"END OF ",fname," SUMMARY"

Custom Summaries

NLP permits you to generate custom summaries for sources by specifying a summaryConfig parameter string. Custom summaries are provided for those who desire to tune the content of NLP generated summaries to their specific needs. Custom summaries allow you to absolutely include, preferentially include, or absolutely exclude sentences into the summary. You can, for example, include or exclude standard components of sources that always appear at the same location, such as a title, byline, copyright, abstract, or summary. You can also absolutely or preferentially include or exclude sentences that contain a specified word.

The source summarization operation first gives each sentence a numeric summary weight, and then creates the summary by selecting the appropriate number of sentences with the highest weights. You can influence this ranking by specifying a summaryConfig parameter to the summary method.

The summaryConfig parameter value is a string consisting of one or more specifications. Each specification consists of three elements separated by vertical bars. For example, "s|2|false". You can concatenate multiple specifications using a vertical bar. For example, "s|1|true|s|2|false". The summaryConfig parameter default is the empty string.

You can configure the summary to select sentences according to the following:

Always (or never) include a sentence.
- By sentence number: "s|1|true" mean that sentence (s) number 1 is always included in the summary (true). For example, if the sources always contain a title, it can be beneficial to always include the first sentence (the title) in the summary. If the second line of each source is an author byline, and you never want this included in the summary, you can specify this as "s|2|false". You can also specify that the last sentence should never be included in the summary: "s|-1|false"; this might be appropriate if all of the sources end with a transcription reference or journal citation. Sentences in a source are numbered forward from 1, or numbered backwards from the end of the source as -1, -2, and so forth.
- By word: "w|requirement|true" mean that any sentence containing the word (w) requirement is always included in the summary (true). You can also exclude sentences containing a specific word. For example, "w|foreign|false" excludes all sentences that contain the word foreign from the summary. A “word” can be one or more whole words: it can consist of a string of multiple words separated by spaces; it cannot consist of a partial word string. Note that words are normalized, and thus must be specified in all lowercase letters.
Give a sentence more summary weight. This increases the chances that the sentence will appear in the summary. Available weight values are the integers 0 through 9.
- By sentence number: "s|1|3" mean that sentence (s) number 1 has its summary weight increased by a factor of 3. For example, the title (first sentence) of sources should be included if it is somewhat descriptive of the contents, but not included if it is not directly descriptive (for example, a literary quotation). Sentences in a source are numbered forward from 1, or numbered backwards from the end of the source as -1, -2, and so forth.
- By word: "w|requirement|2" mean that any sentence containing the word (w) requirement has its summary weight increased by a factor of 2. A “word” can be one or more whole words: it can consist of a string of multiple words separated by spaces; it cannot consist of a partial word string. Note that words are normalized, and thus must be specified in all lowercase letters.

You can specify multiple summary customizations by concatenation. For example: "s|1|true|s|2|false|w|surgery|3|w|hypnosis|false" (always include the first sentence, never include the second sentence, increase the summary weight of all sentences containing the word “surgery”, exclude all sentences containing the word “hypnosis”.

Thus the user can give more or less importance to specific words and/or sentences. The weight of sentences affected by more than one of the specifications in the summaryConfig will be resolved by the Custom Summaries algorithm. This algorithm also applies when there is a conflict between specifications that apply to the same sentence:

If there is a conflict between a sentence (s) specification and a word (w) specification, the sentence specification wins.
If there is a conflict between an include (true) and an exclude (false) involving two s specifications, or two w specifications, the include specification wins.
If there is a conflict between the specified summary length and the number of sentences that must be included or number of sentences that must be excluded, the summary length is ignored.

The options for custom summaries can be set by means of the summaryConfig parameter in the %iKnow.Queries.SourceAPI.GetSummary()Opens in a new tab and %iKnow.Queries.SourceAPI.GetSummaryForText()Opens in a new tab methods.

Querying a Subset of the Sources

NLP provides filters that allow you to include or exclude sources from a query. You can include or exclude sources based on:

Random sampling of sources.
Source contents: specified entities or entities that match a dictionary.
Source characteristics (metadata): including the source Id, sentence count, and the source’s indexed date (the date the source was loaded into NLP).
User-defined metadata characteristics of the sources.

NLP supports the combining of multiple filters through logical AND and logical OR operators. For further details, refer to the Filtering chapter of this manual.

Semantic Attributes

Performance Considerations when Loading Texts