Skip to main content

Stemming

iKnow supports stemming, an optional feature when performing iKnow indexing of sources. Stemming allows iKnow to recognize the stem form of a word that has several grammatical forms. For example, the stem of a noun is typically its nominative singular form. The stem of a verb is typically its infinitive form. Stemming occurs at the entity level. This means that iKnow establishes the stem form of an entity, rather than the stem forms of the entity’s individual words. It is an additional level of normalization from the original source text.

Note:

Stemming support significantly increases iKnow indexing space requirements and impacts iKnow indexing performance. It is recommended that you only activate stemming for a domain when there is a real need for this functionality. Stemming is recommended for use with Russian and Ukrainian text sources.

Stemming is not performed on path-relevant entities.

Stemming is not supported for Japanese at this time.

Stemming may result in a stemmed entity that is no longer grammatically valid. For example, the stem for of the entity “two bananas” would be “two banana”. Because the stem form of an entity may not actually exist in the indexed sources, iKnow pairs the stem form with a “representation form” which is the closest entity (as measured by the Levenshtein distance) to the stem form that actually exists as an indexed entity in the sources. The representative form can, of course, be identical to the stem form, if the stem form exists as an indexed entity in the sources.

iKnow uses a plug-in architecture for generating stems, enabling the use of third-party stemming tools, if available. iKnow uses Hunspell to generate stems. This means that when stemming is required, iKnow searches for the Hunspell affix (.aff) file for the specified language in the INSTALLDIR/dev/hunspell directory. If no Hunspell dictionary is provided for the specified language, iKnow searches for the %Text subclass for the specified language. Caché provides %Text datatype classes for five of the iKnow supported languages: %Text.EnglishOpens in a new tab (en), %Text.FrenchOpens in a new tab (fr), %Text.GermanOpens in a new tab (de), %Text.PortugueseOpens in a new tab (pt), and %Text.SpanishOpens in a new tab (es). If neither a Hunspell dictionary file nor a language-specific %Text class is available, iKnow uses the %Text.TextOpens in a new tab class. Refer to the %TextOpens in a new tab package class documentation in the InterSystems Class Reference for further details.

Configuring Stemming

Stemming can only be activated or deactivated for an empty domain. Once a domain is populated with source texts, you cannot change the stemming option for that domain, except by first removing all domain contents.

To activate stemming for an empty iKnow domain, set the $$$IKPSTEMMING domain parameter to 1. Stemming is inactive by default. To deactivate stemming for an empty iKnow domain, set $$$IKPSTEMMING to 0.

The domain must be empty when setting this parameter. When you add source texts to a domain where stemming is activated, iKnow creates a stem form for every entity and populates appropriate data structures so that stems can be queried.

Create an instance of the stemmer and specify the default language (or languages, using %iKnow.Stemming.MultiLanguageConfig). Specify languages using their ISO 639-1 two-character abbreviations.

Hunspell

Caché provide Hunspell files for Russian (ru) and Ukrainian (uk) in the dev/hunspell directory. You should place Hunspell dictionary files for other languages in the Caché dev/hunspell directory.

iKnow modifies the behavior of Hunspell stemming in one important respect. Hunspell stemming removes prefixes from words. iKnow restores prefixes removed by Hunspell before indexing the resulting stem forms.

If Hunspell returns more than one possible stem for a word, iKnow will attempt to disambiguate these options using the entity context to determine if the entity is a concept or a relation.

Stem Retrieval Methods

The following %iKnow.Queries.EntityAPIOpens in a new tab methods return stem values:

You can also return the frequency and spread for a specified stem ID.

Using Stems

Many iKnow query methods allow you to query on stems rather than on entities. By setting the method argument pUseStems=1, these query methods return values based on stem form values rather than exact entity values. The default, pUseStems=0 causes these query methods to return values based on exact entity values.

For example, the GetTop()Opens in a new tab method, by default, returns the most frequently occurring entities in a domain. If you specify pUseStems=1, this method returns the most frequently occurring stem forms in a domain, potentially merging together as a single stem form multiple entities that differ only in grammatical form.

If you set pUseStems=1 in a domain that does not support stemming, the method returns no results.

For further details, refer to the iKnow Queries chapter of this manual.

FeedbackOpens in a new tab