NLP has three query methods you can use to return the “top” entities in the source documents of a domain:
All three of these methods return top Concepts by default, but can be used to return top Relations. All three of these methods can apply a filter to limit the scope of sources used.
GetTop(): Most-Frequently-Occurring Entities
An NLP query can return the most frequently occurring entities in the source documents in descending order of frequency or spread. Each entity is returned as a separate record in InterSystems IRIS list format.
The entity record format is as follows:
-
The entity ID, a unique integer assigned by NLP.
-
The entity value, specified as a string.
-
Frequency: an integer count of how many times the entity occurs in the source documents.
-
Spread: an integer count of how many source documents contain the entity.
The following query returns the most frequent (top) entities in the sources loaded by this program. By default these are Concept entities. It sets the page (1) and pagesize (50) parameters to specify how many entities to return. It returns (at most) the top 50 entities. It uses the domain default sorttype, which is in descending order by frequency:
#include %IKPublic
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO ListerAndLoader }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
GOTO ListerAndLoader }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
ListerAndLoader
SET domId=domoref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
UseLister
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,50)
SET i=1
WHILE $DATA(result(i)) {
SET outstr = $LISTTOSTRING(result(i),",",1)
SET entity = $PIECE(outstr,",",2)
SET freq = $PIECE(outstr,",",3)
SET spread = $PIECE(outstr,",",4)
WRITE "[",entity,"] appears ",freq," times in ",spread," sources",!
SET i=i+1 }
WRITE "Printed the top ",i-1," entities"
The following GetTop() method returns the top entities by spread:
DO ##class(%iKnow.Queries.EntityAPI).GetTop(.result,domId,1,50,,,$$$SORTBYSPREAD)
GetTopTFIDF() and GetTopBM25()
These two methods return a list of top entities in descending order by a calculated score. By default these are Concept entities. Because they are using different algorithms to assign a score to an entity, the list of “top” entities may differ significantly. For example, the following table shows the relative order of four entities in the Aviation.Event database when analyzed using different methods:
|
“airplane” |
“helicopter” |
“flight instructor” |
“student pilot” |
GetTop() |
1st |
12th |
17th |
43rd |
GetTopTFIDF() |
(not in listing) |
1st |
4th |
22nd |
GetTopBM25() |
(not in listing) |
3rd |
2nd |
1st |
The top 5 entities in the Aviation.Event database returned by GetTop() are: “airplane”, “pilot”, “engine”, “flight”, and “accident”. All of these entities occur at least once in more than half of the sources. While these are frequently-occurring entities, they are of little value in determining the contents of specific sources. An entity that occurs in more than half of the sources is given a negative IDF value. For this reason, none of these entities appear in the GetTopTFIDF() and GetTopBM25() listings.
The following example list the top 50 entities using GetTopTFIDF():
#include %IKPublic
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO ListerAndLoader }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
GOTO ListerAndLoader }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
ListerAndLoader
SET domId=domoref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
UseLister
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
DO ##class(%iKnow.Queries.EntityAPI).GetTopTFIDF(.result,domId,1,50)
SET i=1
WHILE $DATA(result(i)) {
SET outstr = $LISTTOSTRING(result(i),",",1)
SET entity = $PIECE(outstr,",",2)
SET score = $PIECE(outstr,",",3)
WRITE "[",entity,"] has a TFIDF score of ",score,!
SET i=i+1 }
WRITE "Printed the top ",i-1," entities"
The following example list the top 50 entities using GetTopBM25():
#include %IKPublic
DomainCreateOrOpen
SET dname="mydomain"
IF (##class(%iKnow.Domain).NameIndexExists(dname))
{ WRITE "The ",dname," domain already exists",!
SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
GOTO DeleteOldData }
ELSE
{ WRITE "The ",dname," domain does not exist",!
SET domoref=##class(%iKnow.Domain).%New(dname)
DO domoref.%Save()
WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
GOTO ListerAndLoader }
DeleteOldData
SET stat=domoref.DropData()
IF stat { WRITE "Deleted the data from the ",dname," domain",!!
GOTO ListerAndLoader }
ELSE { WRITE "DropData error ",$System.Status.DisplayError(stat)
QUIT}
ListerAndLoader
SET domId=domoref.Id
SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
SET myquery="SELECT TOP 100 ID AS UniqueVal,Type,NarrativeFull FROM Aviation.Event"
SET idfld="UniqueVal"
SET grpfld="Type"
SET dataflds=$LB("NarrativeFull")
UseLister
SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
SET stat=myloader.ProcessBatch()
IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
SourceCountQuery
SET numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId)
WRITE "The domain contains ",numSrcD," sources",!
TopEntitiesQuery
DO ##class(%iKnow.Queries.EntityAPI).GetTopBM25(.result,domId,1,50)
SET i=1
WHILE $DATA(result(i)) {
SET outstr = $LISTTOSTRING(result(i),",",1)
SET entity = $PIECE(outstr,",",2)
SET score = $PIECE(outstr,",",3)
WRITE "[",entity,"] has a BM25 score of ",score,!
SET i=i+1 }
WRITE "Printed the top ",i-1," entities"