Skip to main content

Smart Matching: Creating a Dictionary

Important:

InterSystems has deprecated InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

Smart Matching means combining the results of the NLP Indexing process with some external knowledge you have in the form of a dictionary, taxonomy, or ontology. What makes NLP matching “smart” is that those indexing results help you judge the quality of a match because they identify which words belong together to form concepts and relations. For example, NLP can identify if a match for your dictionary term "flu" is actually referring to the concept "flu" or the concept "bird flu" in your indexed text source. In the latter case, which is called a partial match, it is clear the match should or could be treated differently than the full match where the dictionary term corresponds exactly to the entity in the indexed text source.

To perform Smart Matching, you must create or acquire a dictionary. If you are creating a dictionary, you must then populate it with the items and terms that you wish to use for matching. Once you have a populated dictionary, you can perform matching operations using the contents of the dictionary.

Note:

Dictionary definition is not supported for Japanese at this time.

This chapter describes:

Introducing Dictionary Structure and Matching

To populate an NLP dictionary you first create an item, then associate one or more terms with that item. Commonly a dictionary consists of multiple items, with each item associated with multiple terms. An item is a word or phrase that is a relevant tag for many entities in the source texts. When an entity in the source texts is determined to be a match, it is tagged with the item. For example, the item “ship” is a relevant tag for “ship”, “boat”, “sail”, “oars”, and so forth.

To perform this matching, you populate each item in the dictionary with match terms. A term can be single entity (like “motor boat”) or a phrase or sentence (like “boats are rowed with oars or paddles”). NLP indexes each term in the dictionary using the same language model used for the source texts. NLP then matches each term with the same content unit in the source texts (a Concept term is matched against a Concept in source text; a CRC term is matched against a CRC in source text). If NLP identifies a match between a term and a unit of source text, NLP tags the source text passage with the associated dictionary item. This matching frequently is not identical, but requires NLP to use a scoring algorithm to determine if the term and source text warrant being tagged as a match.

The NLP dictionary facility supports stemming if stemming has been activated for the current domain. This means that a single dictionary term can match any other form of the same word in a source text.

Terminology

A Dictionary is a way to group different terms that have something to do with one another in a logical way. A dictionary could for example be Cities, ICD10 codes, or French wines. As a dictionary is the level of aggregation used within the matching APIs, it is specific to the use case to decide what level of real-world grouping should correspond to a dictionary. Taking a higher level (such as "all ICD10 codes") will yield better performance and use lower disk space, but a lower level (such as "a separate one for all ICD10 categories") might offer grouped results with greater granularity. Each dictionary has a name and a description.

A Dictionary Item is a uniquely identifiable item in your dictionary. Examples of a dictionary item could be cities, the individual codes in ICD10 or individual chateaux. Each dictionary typically has many dictionary items (lots of small dictionaries with few items can decrease performance). A dictionary item has a URI, which should be unique within the domain and can be used as an external identifier, and an optional description. This URI can be used when building rules to interpret matching results later on.

A Dictionary Term is a string that could appear somewhere in a text and represent the Dictionary Item it belongs to. For example, "Antwerp", "Anvers" and "Antwerpen" could be different terms associated with the same dictionary item representing the city of Antwerp. Dictionary terms are the free text strings on which the actual matching is based when doing string-based matching and could be different spellings, translations or synonyms of what your Dictionary Item stands for. These strings are passed through the engine and, when containing more than just a single entity, will automatically be transformed into a more complex structure to be able to match across the boundaries of a single concept (CRC or Path). A dictionary term should also have a language associated with it, if it needs to be processed by the engine.

When processing a new dictionary term by passing it through the NLP engine, one or more Dictionary Elements are generated to represent the different entities identified within the term. For example, a dictionary term "failure of the liver" would be translated into the three elements "failure", "of" and "liver", with "the" being discarded as non-relevant. These elements are generated and managed automatically and only figure in some types of output, so you shouldn't worry too much about them.

If you want to identify specifically-formatted dates, numbers or other formatted pieces of string, you can use Dictionary Formats to specify them, and these can then be included in a Dictionary Term, either representing the complete term, or just a single element within a more complex one. A format is a meaningful pattern of characters, such as a date format. You could associate the formats “nn/nn/nnnn” and “nnnn-nn-nn” with the item named Date. NLP tags any occurrence of these formats in the source texts with the Date item.

Note:

NLP provides semantic attributes that flag many common representations of date, time, duration, and measurement. Check the availability and specificity of these attributes in your national language before defining Dictionary Formats.

Creating a Dictionary

To define a dictionary use the %iKnow.Matching.DictionaryAPIOpens in a new tab class methods to define and populate a dictionary, as described in this section. You can define a dictionary specifically for a domain, or define a dictionary that is domain-independent and can be used by any domain in the current namespace.

%iKnow.Matching.DictionaryAPIOpens in a new tab has a number of methods to create a new dictionary and to assign it items, terms, and formats:

  • CreateDictionary()Opens in a new tab is used to create an NLP dictionary.

    The 1st argument specifies the domain Id as an integer. To assign the dictionary to a domain, specify its domain ID as a positive integer. To define the dictionary as domain-independent, specify 0 as its domain ID. The 2nd argument allows you to specify a meaningful dictionary name. The remaining arguments are optional. The 3rd argument allows you to provide a description of the dictionary, the 4th allows you to specify the language (default is English), and the 5th a custom matching profile. CreateDictionary() returns the dictId, a unique integer. This dictionary ID is used by subsequent smart matching methods. If a dictionary with the specified name already exists, CreateDictionary() returns -1.

  • CreateDictionaryItem()Opens in a new tab is used to create an item within a dictionary. You specify the dictId. CreateDictionaryItem() returns the dictItemId, a unique integer.

  • CreateDictionaryTerm()Opens in a new tab is used to associate a term with an existing item. You supply the dictItemId. CreateDictionaryTerm() returns the dictTermId, a unique integer.

  • CreateDictionaryItemAndTerm()Opens in a new tab is a shortcut that can be used in a specific case. It can be used to create an item and to create a term associated that item when both the term and the item have the same value. For example the item “flu” might have several associated terms (“influenza”, ”le grippe”, bird flu”, “H1N1”); you can use CreateDictionaryItemAndTerm() to create the item “flu” and assign it the associated term “flu”. You could, of course, perform the same operation using two method calls: CreateDictionaryItem() and CreateDictionaryTerm().

  • CreateDictionaryTermFormat()Opens in a new tab is used to associate a term that consists of a format with an existing item. You supply the dictItemId. CreateDictionaryTermFormat() returns the dictTermId, a unique integer.

Dictionaries and Domains

Each dictionary you create can either be specific to a domain, or can be domain-independent and usable by any domain in the current namespace:

  • A domain-specific dictionary is assigned to a domain by specifying a domainId in the CreateDictionary() method. You specify the same domainId for the dictionary’s items, terms, and formats. This method returns a dictId as a sequential positive integer. Matching methods that use this dictionary reference it by this dictId.

  • A domain-independent dictionary is not assigned to a domain. Instead, you specify a domainId of 0 in the CreateDictionary() method. You also specify a domainId of 0 for the dictionary’s items, terms, and formats. This method returns a dictId as a sequential positive integer. Matching methods that use this dictionary reference it by a negative dictId; for example, the dictionary identified by dictId 8 is referenced by the dictId value -8.

    Using a domain-independent dictionary has important consequences for stemming. When you create a domain-specific dictionary of ordinary terms, NLP automatically stems the dictionary terms if the domain is configured as stemmed, and therefore dictionary terms and source text match. When you create a domain-independent dictionary, stem conversion of the dictionary terms is not performed. You can either create a dictionary of ordinary (unstemmed) terms, or a dictionary of stemmed terms. A domain-independent dictionary of ordinary terms cannot be matched against a stemmed domain. A domain-independent dictionary of stemmed terms cannot be matched against an unstemmed domain.

Just as several domains can all have a domain-specific dictionary with the same dictId value, both a domain-specific dictionary and a domain-independent dictionary can have the same integer dictId value. Dictionary match operations can use any combination of domain-specific dictionaries (specified as positive integer IDs) and domain-independent dictionaries (specified as negative integer IDs).

Queries in the Matching API returning matching results will return negative identifiers (for the dictId, itemId, and termId) when the match corresponds to an entry in a domain-independent dictionary. All queries will return the combined results for domain-specific and domain-independent dictionary matches, with the exception of GetDictionaryMatches() and GetDictionaryMatchesById(), which only return results for either domain-specific or domain-independent dictionaries, depending on the values specified in the dictIds parameter. The default is domain-specific dictionary matches.

Dictionary Creation Examples

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following example creates a dictionary named "AviationTerms" and populates it with two items and their associated terms. This dictionary is assigned to a specific domain.

  SET domId=##class(%iKnow.Domain).GetOrCreateId("mydomain")
  /* ... */
CreateDictionary
  SET dictname="AviationTerms"
  SET dictdesc="A dictionary of aviation terms"
  SET dictId=##class(%iKnow.Matching.DictionaryAPI).CreateDictionary(domId,dictname,dictdesc)
  IF dictId=-1 {WRITE "Dictionary ",dictname," already exists",!
                GOTO ResetForNextTime }
  ELSE {WRITE "created a dictionary ",dictId,!}
PopulateDictionaryItem1
  SET itemId=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryItem(domId,dictId,
       "aircraft",domId_dictId_"aircraft")
    SET term1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,itemId,
       "airplane")
    SET term2Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,itemId,
       "helicopter")
PopulateDictionaryItem2
 SET itemId2=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryItemAndTerm(domId,dictId,
        "weather",domId_dictId_"weather")
    SET i2term1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,itemId2,
        "meteorological information")
    SET i2term2Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,itemId2,
        "visibility")
    SET i2term3Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,itemId2,
        "winds")
DisplayDictionary
  SET stat=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryItemsAndTerms(.result,domId,dictId)
  SET i=1
  WHILE $DATA(result(i)) {
      WRITE $LISTTOSTRING(result(i),",",1),!
      SET i=i+1 }
  WRITE "End of items in dictionary ",dictId,!!
   /* ... */
ResetForNextTime
  IF dictId = -1 {
     SET dictId=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryId(domId,dictname)}
  SET stat=##class(%iKnow.Matching.DictionaryAPI).DropDictionary(domId,dictId)
  IF stat {WRITE "deleted dictionary ",dictId,! }
  ELSE    { WRITE "DropDictionary error ",$System.Status.DisplayError(stat) } 

The following example creates a the same dictionary as the previous example, except that this dictionary can be used by any domain within the current namespace:

CreateDictionary
  SET dictname="AviationTerms"
  SET dictdesc="A dictionary of aviation terms"
  SET dictId=##class(%iKnow.Matching.DictionaryAPI).CreateDictionary(0,dictname,dictdesc)
  IF dictId=-1 {WRITE "Dictionary ",dictname," already exists",!
                GOTO ResetForNextTime }
  ELSE {WRITE "created a dictionary ",dictId,!}
PopulateDictionaryItem1
  SET itemId=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryItem(0,dictId,
       "aircraft",0_dictId_"aircraft")
    SET term1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(0,itemId,
       "airplane")
    SET term2Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(0,itemId,
       "helicopter")
PopulateDictionaryItem2
 SET itemId2=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryItemAndTerm(0,dictId,
        "weather",0_dictId_"weather")
    SET i2term1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(0,itemId2,
        "meteorological information")
    SET i2term2Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(0,itemId2,
        "visibility")
    SET i2term3Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(0,itemId2,
        "winds")
DisplayDictionary
  SET domId=##class(%iKnow.Domain).GetOrCreateId("mydomain")
  SET stat=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryItemsAndTerms(.result,0,dictId)
  SET i=1
  WHILE $DATA(result(i)) {
      WRITE $LISTTOSTRING(result(i),",",1),!
      SET i=i+1 }
  WRITE "End of items in dictionary ",dictId,!!
   /* ... */
ResetForNextTime
  IF dictId = -1 {
     SET dictId=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryId(0,dictname)}
  SET stat=##class(%iKnow.Matching.DictionaryAPI).DropDictionary(0,dictId)
  IF stat {WRITE "deleted dictionary ",dictId,! }
  ELSE    { WRITE "DropDictionary error ",$System.Status.DisplayError(stat) } 

Defining a Format Term

The %iKnow.Matching.Formats package provides three simple format classes:

You can create additional format classes as needed.

The following example uses %iKnow.Matching.Formats.SimpleSuffixFormatOpens in a new tab. It first defines a dictionary containing one item: speed. The “speed” item contains two terms: “excessive speed” and the suffix format term “mph” (miles per hour). This suffix format will match any entity that ends with the suffix “mph”, for example “65mph”:

  SET domId=##class(%iKnow.Domain).GetOrCreateId("mydomain")
  /* ... */
CreateDictionary
  SET dictname="Traffic"
  SET dictdesc="A dictionary of traffic enforcement terms"
  SET dictId=##class(%iKnow.Matching.DictionaryAPI).CreateDictionary(domId,dictname,dictdesc)
  IF dictId=-1 {WRITE "Dictionary ",dictname," already exists",!
                GOTO ResetForNextTime }
  ELSE {WRITE "created a dictionary ",dictId,!}
CreateDictionaryItemAndTerms
  SET item1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryItem(domId,dictId,"speed",domId_dictId_"speed")
  SET term1Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTerm(domId,item1Id,
            "excessive speed")
  SET term2Id=##class(%iKnow.Matching.DictionaryAPI).CreateDictionaryTermFormat(domId,
            item1Id,"%iKnow.Matching.Formats.SimpleSuffixFormat",$LB("mph",0,3))
  WRITE "dictionary=",dictId,!,"item=",item1Id,!,"terms=",term1Id," ",term2Id,!!
   /* ... */
ResetForNextTime
  IF dictId = -1 {
     SET dictId=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryId(domId,dictname)}
  SET stat=##class(%iKnow.Matching.DictionaryAPI).DropDictionary(domId,dictId)
  IF stat {WRITE "deleted dictionary ",dictId,! }
  ELSE { WRITE "DropDictionary error ",$System.Status.DisplayError(stat) }

Multiple Formats in a Dictionary Term

You can input dictionary formats directly as part of a dictionary term. This allows you to create a dictionary term containing multiple elements, including one or more format elements, as well as string elements.

To use this feature, you specify a "coded" description of the format as part of the string submitted to the CreateDictionaryTerm()Opens in a new tab method. This coded description has the following format:

@@@User.MyFormatClass@@@param1@@@param2@@@

This description consists of the full class name of the format class (implementing %iKnow.Matching.Formats.Format), a @@@ separator, and a @@@-delimited list of the format parameters to be passed to the format class. The entire description is delimited with @@@ markers at the beginning and end.

If the format class takes no parameters, or the defaults are to be used, specify the format class name delimited by @@@ markers.

When including this format in a dictionary term string, you must make sure that NLP will recognize it as a single entity. For examples, the term "was born in @@@User.MyYearFormat@@@" is interpreted as a single entity, but the term "was born in the year @@@User.MyYearFormat@@@" is not.

If NLP cannot find the specified format class, the @@@ usage is considered intentional and the whole entity is treated as a simple string element.

Using this syntax makes it easier to load dictionaries from files or tables without requiring separate steps or actions for the formats.

Listing and Copying Dictionaries

The %iKnow.Matching.DictionaryAPIOpens in a new tab class has a number of methods to count or list existing dictionaries and their items and terms.

The %iKnow.Utils.CopyUtilsOpens in a new tab class has a number of methods to copy a dictionary or all dictionaries from one domain to another.

Listing Existing Dictionaries

The following example lists all of the dictionaries in the domain. For the purpose of demonstration, this example first creates two empty dictionaries, one in English (the default language) and one in French:

  SET domId=##class(%iKnow.Domain).GetOrCreateId("mydomain")
  SET dictname1="Diseases",dictname2="Maladies"
  SET dictdesc1="English disease terms",dictdesc2="French disease terms"
CreateFirstDictionary
  SET dictId1=##class(%iKnow.Matching.DictionaryAPI).CreateDictionary(domId,dictname1,dictdesc1)
  IF dictId1 = -1 {
     SET dictId=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryId(domId,dictname1)
     SET stat=##class(%iKnow.Matching.DictionaryAPI).DropDictionary(domId,dictId)
     IF stat '= 1 { WRITE "DropDictionary error ",$System.Status.DisplayError(stat)
                    QUIT }
     GOTO CreateFirstDictionary }
    ELSE {WRITE "created a dictionary ",dictId1,!}
CreateSecondDictionary
  SET dictId2=##class(%iKnow.Matching.DictionaryAPI).CreateDictionary(domId,dictname2,dictdesc2,"fr")
  IF dictId2 = -1 {
     SET dictId=##class(%iKnow.Matching.DictionaryAPI).GetDictionaryId(domId,dictname2)
     SET stat=##class(%iKnow.Matching.DictionaryAPI).DropDictionary(domId,dictId)
     IF stat '= 1 { WRITE "DropDictionary error ",$System.Status.DisplayError(stat)
                    QUIT }
     GOTO CreateSecondDictionary }
  ELSE {WRITE "created a dictionary ",dictId2,!}

GetDictionaries
  SET stat=##class(%iKnow.Matching.DictionaryAPI).GetDictionaries(.dicts,domId)
  WRITE "get dictionaries status is:",$System.Status.DisplayError(stat),!!
  SET k=1
  WHILE $DATA(dicts(k)) {
      WRITE $LISTTOSTRING(dicts(k)),!
      SET k=k+1 }
  WRITE "End of list of dictionaries"

GetDictionaries()Opens in a new tab lists the Id, name, description, and language for each dictionary.

Copying Dictionaries

You can copy dictionaries from one domain to another within the current namespace.

Extending Dictionary Constructs

Though NLP only describes simple dictionaries in the Matching API, this does not restrict you from using more advanced tools like ontologies, taxonomies or other more hierarchical constructs. The goal of the Matching API is to provide the hooks for just the matching, rather than yet another generic structure that tries to cover every construct. Therefore, you should just flatten the structure of the ontology or taxonomy you have. By appropriately choosing your dictionary item URIs, you'll be able to reconstruct or interpret the matching results within the context of your ontology or taxonomy.

In the Matching API, the formatting bits are pluggable in the sense that you can provide your own implementation of a class that does for example regular expression matching by implementing the %iKnow.Matching.Formats.FormatOpens in a new tab interface.