Skip to main content
Previous sectionNext section

Loading Text Data Programmatically

Before NLP can analyze text data, the data sources must be loaded into a domain. This can be done in three ways:

  • Using the Domain Architect to specify the data locations source texts for a domain. The Build button loads the specified sources into the domain.

  • Creating an %iKnow.DomainDefinition subclass allows you to specify the data locations for source texts for a domain. It generates a %Build() method in a dependent class that contains the logic to load this data.

  • Specifying a Loader and Lister programmatically to load the specified sources into a domain, as described in this chapter.

To make text data available for NLP analysis, the domain must invoke an instance of a Loader and a Lister. The Loader supervises NLP processing of text sources, using the Lister and a Processor. The Lister identifies the text sources to be used by the Loader. NLP provides a variety of Listers for different types of source text data. Each Lister, by default, automatically invokes the corresponding Processor with default parameters. There is one Loader used for data sources of all types.

Note that the Loader and Lister objects can be created in any order, but both must have been created before you invoke the Lister AddListToBatch() instance method and then the Loader ProcessBatch() instance method (or other equivalent Lister and Loader methods).

Loader

The Loader (%iKnow.Source.Loader) is the main class coordinating the loading process. You must create a new loader object for the domain. To create a loader object:

  SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
       SET domId=domo.Id
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
Copy code to clipboard

After creating a loader and a lister, you issue instance methods to list and process the sources. For example, when performing a batch load you issue the Lister AddListToBatch() instance method to list the text sources. You then issue the Loader ProcessBatch() instance method to process the listed sources. This Loader method calls the Lister to scan the locations marked by AddListToBatch(), then calls the Processor to read those documents and push them to the NLP engine and finally, it invokes the ^%iKnow.BuildGlobals routine to process the staging globals loaded by the NLP engine.

Loader Error Logging

If a load operation completes, but encounters errors in loading one or more sources, these errors are recorded in an error log. Errors of varying severity can be retrieved using the GetErrors(), GetWarnings(), and GetFailed() methods. For example, a failed load error (GetFailed()) occurs if you attempt to load a source file that has no contents. A warning load error (GetWarnings()) occurs if there is an error in the source metadata.

You can use the ClearLogs() method to clear the error log of error messages at any or all of these severity levels.

Loader Reset()

If a load operation didn't complete in an expected fashion and you want to start from scratch, you should invoke the Reset() method for the loader instance, as follows:

  DO myloader.Reset()
Copy code to clipboard

Lister

The Lister identifies text files, records, or other sources of unstructured data you wish NLP to index. That is, all text that will eventually end up as a Source in the domain. The unit of content in NLP is a Source, which can represent any unit of text you wish to analyze, such as a text file, a record in a SQL table, an RSS posting, or other text source.

Usually a Source is a text containing multiple sentences. However, a source can contain content of any type. For example, a file containing the number 123 is treated as a Source containing one sentence. A file with no contents is not listed as a Source.

All listers are found in class %iKnow.Source.Lister and have their own specific type of sources they can scan. For example, the subclass %iKnow.Source.File.Lister scans a file system and the subclass %iKnow.Source.RSS.Lister scans RSS web feeds, such as blog postings, in XML file format. NLP provides seven listers for different types of sources. You can also create your own custom lister.

Most text sources require a Lister. However, text that is directly specified as a string does not require a Lister.

Through the AddListToBatch() method you can instruct the Lister to look into a specific directory, SQL table, or RSS feed for Sources. The lister parameters depend on the actual Lister class.

Initializing a Lister

You can create a Lister instance for a domain using the %New() method for that type of lister, supplying the domain Id. The following example creates two listers within the specified domain:

  SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
       SET domId=domo.Id
  SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
  WRITE flister,!
  SET rlister=##class(%iKnow.Source.RSS.Lister).%New(domId)
  WRITE rlister
Copy code to clipboard

Each lister automatically invokes the corresponding processor, as follows:

  • The File.Lister invokes the File.Processor.

  • The Global.Lister invokes the Global.Processor.

  • The Domain.Lister invokes the Domain.Processor.

  • All other Listers invokes the Temp.Processor. The %iKnow.Source.Temp.Processor has that name because it processes temporary globals that are automatically created and deleted by NLP during the loading process.

Each processor has default processor parameters, which are appropriate for most NLP sources. Therefore, in most cases, you do not need to specify a processor or processor parameters. If you do not specify a processor, NLP uses the default processor, as shown by the DefaultProcessor() method.

Overriding Lister Instance Defaults

In most cases, the lister instance defaults are appropriate for the processing of your NLP sources.

If you wish to overriding lister instance defaults for Configuration, Processor, or Converter objects, you can, optionally, use the Init() instance method to initialize the Lister instance. If you omit Init() the defaults are used.

The complete Lister initialization is as follows:

Init(config,processor,processorparams,converter,converterparams)

To specify the default for any of these items, specify the empty string ("") as the Init() parameter value.

You can also initialize these objects separately using the SetConfig(), SetProcessor(), and SetConverter() methods.

  • Configuration (Config): If you do not specify a configuration, NLP uses the default configuration. A configuration specifies what language(s) the text documents contain, and whether or not automatic language identification should be used. A configuration object is not domain-specific; you can use the same configuration for multiple domains. While not required, explicitly specifying a configuration is recommended.

  • Processor: Using lister.Init() you can specify a processor and processor parameters. A processor reads the texts into NLP. Specifying a processor is optional. If you do not specify a processor, NLP uses the default processor and its parameter defaults. If you specify a processor, you can specify the processor parameter values, as shown in the following example:

      SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
      SET processor="%iKnow.Source.File.Processor"
      SET pparams=$LB("Latin1")
      DO flister.Init("",processor,pparams,"","")
    
    Copy code to clipboard

    If explicitly specified, the processor subclass should be either of the same type as the Lister subclass (for example, %iKnow.Source.File.Lister takes %iKnow.Source.File.Processor) or %iKnow.Source.Temp.Processor if the Lister subclass has no corresponding Processor subclass. You can also create your own custom processor.

    Processor parameters are specified as an InterSystems IRIS list. For %iKnow.Source.File.Processor the first list element is the name of the character set used (for example "Latin1"). The %iKnow.Source.Temp.Processor does not take any processor parameters.

  • Converter: Using lister.Init() you can specify a user-defined converter and converter parameters. A Converter converts formatted source documents to plain text, removing HTML or XML tags, PDF formatting, or other non-text contents. Usually separate converters are used for each source document formatting type. Specifying a converter is optional. The default is to use no converter. If no converter is used, NLP indexes formatting contents as well as text contents.

Lister Assigns IDs to Sources

The lister assigns two unique IDs to each source:

  • Source ID (internal ID): a unique integer assigned by NLP that is used for NLP internal processing.

  • External ID: a unique identifying string or number. The External ID is used as the link for any user-specified application that wishes to use NLP. The External ID has the following structure:

    ListerReference:FullReference
    

    The Lister Reference is either the full class name of the Lister class used to load this source, or a short alias defined by the Lister class itself, prefixed with a colon. The Full Reference is a string for which the format is defined by the Lister class. It contains a Group Name and a Local Reference. It is up to the Lister to provide the implementation to derive the Group Name and Local Reference from this Full Reference, and to rebuild the Full Reference from the Group Name and Local Reference.

    For example, the text file external ID :FILE:c:\mytextfiles\mydoc.txt consists of:

    • ListerReference: the Lister class alias :FILE

    • FullReference: c:\mytextfiles\mydoc.txt, which consists of the Group Name c:\mytextfiles\ and the Local Reference mydoc.txt.

    For data in an SQL table, the ListerReference is :SQL. The Group Name is the groupfield, a field in the record that contains a unique value, and the Local Reference is the row ID.

    For data in a string or global variable, the ListerReference is :TEMP.

    The external ID format described here is the default; external ID format is configurable using the SimpleExtIds domain parameter.

You can access a source using either ID. The %iKnow.Queries.SourceAPI class contains methods for accessing these IDs. The GetByDomain() method returns both IDs for each source. Given the source ID, the GetExternalId() method returns the external ID. Given the external ID, the GetSourceId() method returns the source ID.

You can determine the lister class alias using the GetAlias() method of the %iKnow.Source.File.Lister class. If no alias exists, the External ID contains the full Lister class name.

Lister Defaults Example

The following is a minimal Lister and Loader example, taking all defaults. It establishes a domain, then creates Lister and Loader instance objects for that domain. It does not invoke lister.Init(), but takes the defaults for configuration, processor, and converter. It then lists and loads a directory of user-defined .txt and .log files:

  SET domo=##class(%iKnow.Domain).NameIndexOpen("mydomain")
     SET domId=domo.Id
SetListerAndLoader
  SET mylister=##class(%iKnow.Source.File.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseListerAndLoader
  SET install=$SYSTEM.Util.DataDirectory()
  SET dirpath=install_"mgr\Temp\iris\mytextfiles"
  SET stat=mylister.AddListToBatch(dirpath,$LB("txt","log"),0,"")
      WRITE "The lister status is ",$System.Status.DisplayError(stat),!
  SET stat=myloader.ProcessBatch()
      WRITE "The loader status is ",$System.Status.DisplayError(stat),!
Copy code to clipboard

Most examples in this book delete old data before using the Lister and Loader; this old data deletion is for demonstration purposes to allow these examples to be run repeatedly. Most examples in this book do not specify the processor and processor parameters, taking the defaults. Many examples in this book specify values for configuration rather than taking the defaults.

Lister Parameters

When you invoke a method to specify sources, you specify Lister parameters. You specify the same Lister parameters for the AddListToBatch() Lister instance method (for large batch loads of sources) and the ProcessList() Loader instance method (for adding a small number of sources to an existing batch of sources).

There are four Lister parameters that cumulatively define which sources are to be listed for NLP indexing:

  • Path: the location where the sources are located, specified as a string. This parameter is mandatory.

  • Extensions: one or more file extension suffixes that identify which sources are to be listed. Specified as an InterSystems IRIS list data structure, each element of which is a string (refer to $LISTBUILD for details on InterSystems IRIS list data structures). By default the Lister selects all files in the Path directory that contain data, regardless of their file extension suffix. This includes files with no file extension suffix or with a file extension suffix indicating a non-text (such as .jpg). Empty files are not selected. Directories are not selected. When an extension suffix parameter is specified, the Lister selects only those files in the Path directory with that file extension suffix (or with no file extension suffix) that contain data.

  • Recursive: a boolean value that specifies whether to search subdirectories of the path for sources. If selected, multiple levels of subdirectories are searched for sources. 1 = include subdirectories. 0 = do not include subdirectories. The default is 0.

  • Filter: a string specifying a filter used to limit which sources are to be listed for NLP indexing. For example, a user-designed filter could limit the Lister to only those files that have a specified substring in their file names. The default is to use no filter. (Note that this use of the word “filter” is completely separate from the filters in the %iKnow.Filters class that are used to include or exclude already-indexed sources supplied to an NLP query.)

Batch or List?

NLP provides two ways to load sources of all types, batch loading (ProcessBatch()) or list loading (ProcessList()). Both perform the same processing, they differ in their speed of execution. Which one you use depends primarily on how many sources you are loading. As a general rule, when loading ten or fewer sources, use ProcessList(); when loading one hundred or more sources, use ProcessBatch(). Which to use on intermediate numbers of sources depends on the nature of the specific sources.

Listing and Loading Examples

The examples in this section show the different ways to load sources:

You can also load sources as virtual sources using loader.ProcessVirtualList() or loader.ProcessVirtualBuffer(), as described in Loading a Virtual Source.

Loading Files

The following executable example performs a batch load of the source files in the Windows directory dirpath that have the extensions .txt or .log.

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE 
     { SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       GOTO SetEnvironment }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
SetEnvironment
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
CreateListerAndLoader
  SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
      DO flister.Init("myconfig","","","","")
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseListerAndLoader
  SET install=$SYSTEM.Util.DataDirectory()
  SET dirpath=install_"mgr\Temp\iris\mytextfiles"
  SET stat=flister.AddListToBatch(dirpath,$LB("txt","log"),0,"")
      WRITE "The lister status is ",$System.Status.DisplayError(stat),!
  SET stat=myloader.ProcessBatch()
      WRITE "The loader status is ",$System.Status.DisplayError(stat),!
QueryLoadedSources
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources"
Copy code to clipboard

This example performs a batch load, appropriate for loading a large number of files. To load a small number of files use the SetLister() and ProcessList() methods.

Loading SQL Records

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following executable example performs a batch load of the records of the Cinema.Review table. It loads as a source text the ReviewText field value for each record. You can specify a source text field of data type %String or %Stream.GlobalCharacter (character stream data). If there is an error in the SQL query, the Loader returns an error status.

NLP programs that load SQL data must use the %iKnow.Source.SQL.Lister. This lister always invokes the %iKnow.Source.Temp.Processor, which takes no parameters. There is, therefore, no reason to specify the processor, unless you have created your own custom processor.

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
      SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       WRITE "Created the ",dname," domain with domain ID ",domoref.Id,!
       GOTO SetEnvironment }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
SetEnvironment
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
CreateListerAndLoader
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
      DO flister.Init("myconfig")
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
QueryBuild
   SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
UseLister
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
UseLoader
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
QueryLoadedSources
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources loaded"
Copy code to clipboard

This example performs a batch load, appropriate for loading a large number of SQL records. To load a small number of SQL records use the SetLister() and ProcessList() methods.

You can also use the %SYSTEM.iKnow utility method IndexTable().

Loading Elements of a Subscripted Global

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following executable example loads the elements of a subscripted global. It uses the %iKnow.Source.Global.Lister and specifies the following Lister parameters to the ProcessList() method: global name, first subscript (inclusive), and last subscript (inclusive). This example uses the ^Aviation.AircraftD global. Because this is a sparse array, only a few of the subscripts between 1 and 50,000 contain data:

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE 
     { SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       GOTO SetEnvironment }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
SetEnvironment
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
ListerAndLoader
  SET mylister=##class(%iKnow.Source.Global.Lister).%New(domId)
        DO mylister.Init("myconfig","","","","")
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET stat=myloader.SetLister(mylister)
  IF stat '= 1 { WRITE "SetLister error ",$System.Status.DisplayError(stat)
                 QUIT}
  SET gbl="^Aviation.AircraftD"
  SET stat=myloader.ProcessList(gbl,1,50000)
  IF stat '= 1 { WRITE "ProcessList error ",$System.Status.DisplayError(stat)
                 QUIT }
SourceSentenceQueries
  SET numSrcD=##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
  WRITE "The domain contains ",numSrcD," sources",!
  SET numSentD=##class(%iKnow.Queries.SentenceAPI).GetCountByDomain(domId)
  WRITE "These sources contain ",numSentD," sentences"
Copy code to clipboard

The ProcessList() method can specify only one subscript level at a time. In order to iterate through multiple subscript levels, you must write code to invoke this method at the desired subscript level. For example, to load the second level subscripts 1 and 2, you would write code such as the following:

  FOR i=1:1:90000 {
    SET gbl="^Aviation.NarrativeS("_i_")"
    SET stat=myloader.ProcessList(gbl,1,2) }
Copy code to clipboard

This loads globals such as ^Aviation.NarrativeS(85879,1) and ^Aviation.NarrativeS(85879,2).

Loading a String

The following executable example loads a single global (or a string literal) as a source file. Note that no Lister is required when loading a string. You can specify the Configuration to apply in the ProcessBuffer() method.

ConfigurationCreateOrOpen
  IF ##class(%iKnow.Configuration).Exists("EnFr") {
       SET cfg=##class(%iKnow.Configuration).Open("EnFr") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("EnFr",1,$LB("en","fr"))
         DO cfg.%Save() }
DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save() 
       SET domId=domoref.Id
       WRITE "Created the ",dname," domain with domain ID ",domId,!
       GOTO CreateLoader }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            SET domId=domoref.Id
            GOTO CreateLoader }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
CreateLoader
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
UseLoader
  SET ^a="I drove at 70mph then sped up to 100mph when the light changed."
  DO myloader.BufferSource("ref",^a)
  DO myloader.ProcessBuffer("EnFr")
QuerySources
  WRITE "number of sources:",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)  
Copy code to clipboard

The first argument of the BufferSource() method specifies a unique external source Id. The following example creates a separate source for each global subscript:

  SET i=1
  WHILE $DATA(^a(i)) {
     DO myloader.BufferSource("ref"_i,^a(i))
     DO myloader.ProcessBuffer()
     SET i=i+1 }
  WRITE "end of data"
Copy code to clipboard

You can also use the %SYSTEM.iKnow utility method IndexString().

Updating the Domain Contents

After you have performed an initial load of sources to a domain, you can change this list of sources by adding sources or by deleting sources. Updating a domain refers to responding to changes in the set of source texts. This should not be confused with upgrading a domain, which refers to responding to changes in the NLP software, commonly after installing a significant new version of InterSystems IRIS.

Adding Sources

After you have performed an initial load of sources to a domain (using the AddListToBatch() and ProcessBatch() methods) you may want to add more files to the list of sources. This is done using the SetLister() and ProcessList() methods. The ProcessList() method takes the same parameters as the AddListToBatch() method.

  • To add a one source at a time: SET stat=myloader.ProcessList("C:\mytextfiles\newfile.txt")

  • To add a directory of sources: SET stat=myloader.ProcessList("C:\mytextfiles\logfiles",$LB("log"),0,"")

Adding more sources to a batch load is shown in the following example:

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
      { SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
        GOTO DeleteOldData }
  ELSE { SET domoref=##class(%iKnow.Domain).%New(dname)
         DO domoref.%Save()
         GOTO SetEnvironment }
DeleteOldData
  SET stat=domoref.DropData()
  IF stat { GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
SetEnvironment
  SET domId=domoref.Id
  IF ##class(%iKnow.Configuration).Exists("myconfig") {
         SET cfg=##class(%iKnow.Configuration).Open("myconfig") }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New("myconfig",0,$LISTBUILD("en"),"",1)
         DO cfg.%Save() }
ListerAndLoader
  SET flister=##class(%iKnow.Source.File.Lister).%New(domId)
      DO flister.Init("myconfig","","","","")
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET stat=myloader.SetLister(flister)
SourceBatchLoad
  SET install=$SYSTEM.Util.DataDirectory()
  SET dirpath=install_"mgr\Temp\iris\mytextfiles"
  SET stat=flister.AddListToBatch(dirpath,$LB("txt"),0,"")
  SET stat=myloader.ProcessBatch()
  IF stat '= 1 { WRITE "Loader error ",$System.Status.DisplayError(stat)
                     QUIT }
QueryLoadedSources
  WRITE "Source count is ",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId),!
ExpandListofSources
  SET elister=##class(%iKnow.Source.File.Lister).%New(domId)
      DO elister.Init("myconfig")
  SET stat=myloader.SetLister(elister)
  SET addpath=install_"dev\IRIS"
  SET stat=myloader.ProcessList(addpath,$LB("txt"),1,"")
  IF stat '= 1 { WRITE "The ProcessList loader status is ",$System.Status.DisplayError(stat)
                 QUIT }
QueryTotalSources
  WRITE "Expanded source count is ",##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
Copy code to clipboard

You can also use the %SYSTEM.iKnow utility methods IndexFile() and IndexDirectory().

Deleting Sources

You can remove a source that has been loaded to a domain using the DeleteSource() method. This method cannot be used to delete a virtual source; a separate DeleteVirtualSource() method is provided for this purpose. Both methods are found in the %SYSTEM.iKnow class.

Loading a Virtual Source

A virtual source is a source that is not static. You might, for example, use a virtual source for a file that is being frequently modified. The srcId of a virtual source is a negative integer. The external Id of a virtual source begins with the ListerReference (the Lister class alias), commonly :TEMP.

Adding a virtual source does not update NLP statistics. For this reason, using a virtual source may be desirable when you wish to temporarily add sources for a specific purpose without incurring the overhead of revising the domain statistics. You should use a virtual source when adding a source that is being continuously modified, such a source in the process of being written. Because the virtual source Id is a negative number, it is easy to distinguish virtual sources from regular sources. Different methods are used to delete virtual sources and regular sources.

You can load virtual sources using loader.SetLister() and loader.ProcessVirtualList() or loader.BufferSource() and loader.ProcessVirtualBuffer(). The following program loads a virtual source using ProcessVirtualBuffer().

DomainCreateOrOpen
  SET dname="mydomain"
  IF (##class(%iKnow.Domain).NameIndexExists(dname))
     { WRITE "The ",dname," domain already exists",!
       SET domoref=##class(%iKnow.Domain).NameIndexOpen(dname)
       GOTO DeleteOldData }
  ELSE 
     { WRITE "The ",dname," domain does not exist",!
       SET domoref=##class(%iKnow.Domain).%New(dname)
       DO domoref.%Save()
       SET domId=domoref.Id
       WRITE "Created the ",dname," domain with domain ID ",domId,!
       GOTO SetEnvironment }
DeleteOldData /* This DOES NOT delete virtual sources */
  SET stat=domoref.DropData()
  IF stat { WRITE "Deleted the data from the ",dname," domain",!!
            SET domId=domoref.Id
            GOTO SetEnvironment }
  ELSE    { WRITE "DropData error ",$System.Status.DisplayError(stat)
            QUIT}
SetEnvironment
  SET config="VSConfig"
  IF ##class(%iKnow.Configuration).Exists(config) {
         SET cfg=##class(%iKnow.Configuration).Open(config) }
  ELSE { SET cfg=##class(%iKnow.Configuration).%New(config,1)
         DO cfg.%Save() }
CreateLoader
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
VirtualSource
  SET node="",(total,status)=0
  FOR { SET node=$ORDER(^VendorData(node),1,data) QUIT:node="" 
   SET company=$LIST(data,1) QUIT:company=""
   SET address=$LTS($LIST(data,2))
   SET total=total+1
   SET status=myloader.BufferSource("SourceTest"_total,company)
   SET status=myloader.BufferSource("SourceTest"_total,address)
  }
  SET status=myloader.ProcessVirtualBuffer(config)

  SET vsrclist=myloader.GetSourceIds()
  FOR i=1:1:$LL(vsrclist) {
     SET srcid=-$LIST(vsrclist,i)
     WRITE "External Id=",##class(%iKnow.Queries.SourceAPI).GetExternalId(domId,srcid)
     WRITE "  Source Id=",srcid,!
     WRITE "  Sentence Count=",##class(%iKnow.Queries.SentenceAPI).GetCountBySource(domId,$lb(srcid)),!
  }
Copy code to clipboard

Note that the %iKnow.Queries.SourceAPI.GetCountByDomain() method does not count virtual sources. You can determine if a virtual source has been loaded by invoking %iKnow.Queries.SourceAPI.GetExternalId(domId,-1). Here -1 is the srcId of the first virtual source loaded.

By default, many NLP queries process only ordinary sources and ignore virtual sources. To use these queries to process a virtual souce you must specify a vSrcId parameter value for the query method.

Deleting a Virtual Source

The %iKnow.Source.Loader class provides two methods for deleting virtual sources.

  • DeleteVirtualSource() deletes a single virtual source indexed for a domain. You specify the domain Id (a positive integer) and the virtual source Id (a negative integer). This deletes all NLP entities generated for this source text.

  • DeleteAllVirtualSources() deletes all of the virtual sources indexed for a specified domain. This deletes all NLP entities generated for these source texts.

Copying and Re-indexing Loaded Source Data

After you have successfully loaded sources into a domain, you may wish to copy some or all of these sources to another domain. When NLP copies these loaded sources it also re-indexes them. The copied sources therefore have different source Ids and entity Ids; the external Ids are not changed.

Some reasons you might want to copy/re-index from one domain to another:

  • To create a copy of a domain. You may wish to make a backup copy, or to create a copy to serve as a snapshot of the domain at a particular time. For example, when indexing RSS feeds you may wish to create a snapshot because these feeds change over time; at a future date you might no longer have access to the original source data.

  • To create a domain containing a subset of the original set of sources. The new domain can be smaller, more efficient, and easier to work with. You can specify this copied subset of sources by a list of source Ids to copy, or by a filter that limits which sources to copy. For example, you could create a domain consisting of only the newest sources, which you could then query without having to filter by date for each query.

  • To create a domain containing the merged sets of sources from two domains, or to add sources from one domain into a domain that already contains sources.

  • To re-index the sources in a domain after extreme modification of the set of sources. For example, if you very frequently add or delete multiple sources in a domain, the indexing may no longer be optimal. (Normal adding and deleting of sources does not degrade index performance.) By copying the domain, you re-index the current sources that you are copying, making the indexing in the new domain optimal.

  • To apply NLP language model revisions. Release versions of NLP commonly contain improvements to its language models. These may include introduction of support for new languages and improvements to already-supported languages. Copying the set of sources in a domain re-indexes these sources, and therefore applies the most current NLP language models to the copied sources.

You use the %iKnow.Source.Domain.Lister class to copy/re-index from one domain to another. The new domain must already be defined before you can create a Lister instance for this class using the %New() method. Both domains must be in the same namespace.

The following example populates the firstdomain domain, then copies the contents of firstdomain to an empty domain named newdomain, automatically re-indexing the newdomain contents:

EstablishAndPopulateFirstDomain
   SET domOref=##class(%iKnow.Domain).%New("firstdomain")
   DO domOref.%Save()
   SET domId=domOref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TestQueryFirstDomain
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources in the from domain",!
CreateSecondDomain
   SET domOref=##class(%iKnow.Domain).%New("newdomain")
   DO domOref.%Save()
   SET domNewId=domOref.Id
CopyAndReindexFromFirstDomainToSecondDomain
  SET newlister=##class(%iKnow.Source.Domain.Lister).%New(domNewId)
  SET newloader=##class(%iKnow.Source.Loader).%New(domNewId)
  SET stat=newlister.AddListToBatch(domId)
  SET stat=newloader.ProcessBatch()
TestQuerySecondDomain
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domNewId)," sources in the to domain"
CleanUpForNextTime
  SET stat=##class(%iKnow.Domain).%DeleteId(domId)
  IF stat '= 1 {WRITE "Domain delete error:",stat }
  SET stat=##class(%iKnow.Domain).%DeleteId(domNewId)
  IF stat '= 1 {WRITE "Domain delete error:",stat }
Copy code to clipboard

The AddListToBatch() method can take a second lister parameter to specify which sources are to be copied. It can either specify a list of sources (a comma-separated list of source Id integers) or specify a filter. The following example is identical to the previous example, except that it limits which sources are to be copied by specifying a comma-separated list of source Ids.

EstablishAndPopulateFirstDomain
   SET domOref=##class(%iKnow.Domain).%New("firstdomain")
   DO domOref.%Save()
   SET domId=domOref.Id
  SET flister=##class(%iKnow.Source.SQL.Lister).%New(domId)
  SET myloader=##class(%iKnow.Source.Loader).%New(domId)
  SET myquery="SELECT Top 25 ID AS UniqueVal,Type,NarrativeFull,EventDate FROM Aviation.Event"
   SET idfld="UniqueVal"
   SET grpfld="Type"
   SET dataflds=$LB("NarrativeFull")
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds)
      IF stat '= 1 {WRITE "The lister failed: ",$System.Status.DisplayError(stat) QUIT }
  SET stat=myloader.ProcessBatch()
      IF stat '= 1 {WRITE "The loader failed: ",$System.Status.DisplayError(stat) QUIT }
TestQueryFirstDomain
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)," sources in the from domain",!
CreateSecondDomain
   SET domOref=##class(%iKnow.Domain).%New("newdomain")
   DO domOref.%Save()
   SET domNewId=domOref.Id
SubsetOfSourcesToCopy
  SET subset="1,3,5,7,9,11,13,15,17,19"
CopyAndReindexFromFirstDomainToSecondDomain
  SET newlister=##class(%iKnow.Source.Domain.Lister).%New(domNewId)
  SET newloader=##class(%iKnow.Source.Loader).%New(domNewId)
  SET stat=newlister.AddListToBatch(domId,subset)
  SET stat=newloader.ProcessBatch()
TestQuerySecondDomain
  WRITE ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domNewId)," sources in the to domain"
CleanUpForNextTime
  SET stat=##class(%iKnow.Domain).%DeleteId(domId)
  IF stat '= 1 {WRITE "Domain delete error:",stat }
  SET stat=##class(%iKnow.Domain).%DeleteId(domNewId)
  IF stat '= 1 {WRITE "Domain delete error:",stat }
Copy code to clipboard

UserDictionary and Copied Sources

A UserDictionary is applied when a source is listed. Therefore, any UserDictionary modifications made to the initial loaded sources will appear in the copied sources. However, because the copy operation is also a list operation, you can also apply a new UserDictionary to modify the sources as they are copied.

For example, the UserDictionary used when the sources were originally listed substitutes “Doctor” for the abbreviation “Dr.”; this substitution will be present in the copied sources. Later you modified the UserDictionary to also substitute “doctor” for “physician”. This change to your UserDictionary had no effect on the already-loaded sources. When you copy the sources, you apply this revised UserDictionary. The “Dr.” to “Doctor” substitution is performed 0 times, because that substitution is already present in the initial loaded sources; the “physician” to “doctor” substitution is performed on the copied sources.