Skip to main content

Stemming

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

NLP supports stemming, an optional feature when performing NLP indexing of sources. Stemming allows NLP to recognize the stem form of a word that has several grammatical forms. For example, the stem of a noun is typically its nominative singular form. The stem of a verb is typically its infinitive form. Stemming occurs at the entity level. This means that NLP establishes the stem form of an entity, rather than the stem forms of the entity’s individual words. It is an additional level of normalization from the original source text.

Note:

Stemming support significantly increases NLP indexing space requirements and impacts NLP indexing performance. It is recommended that you only activate stemming for a domain when there is a real need for this functionality. Stemming is recommended for use with Russian and Ukrainian text sources.

Stemming is not performed on path-relevant entities.

Stemming is not supported for Japanese at this time.

Stemming may result in a stemmed entity that is no longer grammatically valid. For example, the stem for of the entity “two bananas” would be “two banana”. Because the stem form of an entity may not actually exist in the indexed sources, NLP pairs the stem form with a “representation form” which is the closest entity (as measured by the Levenshtein distance) to the stem form that actually exists as an indexed entity in the sources. The representative form can, of course, be identical to the stem form, if the stem form exists as an indexed entity in the sources.

NLP uses a plug-in architecture for generating stems, enabling the use of third-party stemming tools, if available. NLP uses Hunspell to generate stems. This means that when stemming is required, NLP searches for the Hunspell affix (.aff) file for the specified language in the INSTALLDIR/dev/hunspell directory. If no Hunspell dictionary is provided for the specified language, NLP searches for the %Text subclass for the specified language. InterSystems IRIS® data platform provides %Text datatype classes for five of the NLP supported languages: %Text.English (en), %Text.French (fr), %Text.German (de), %Text.Portuguese (pt), and %Text.Spanish (es). If neither a Hunspell dictionary file nor a language-specific %Text class is available, NLP uses the %Text.Text class. Refer to the %TextOpens in a new tab package class documentation in the InterSystems Class Reference for further details.

Configuring Stemming

Stemming can only be activated or deactivated for an empty domain. Once a domain is populated with source texts, you cannot change the stemming option for that domain, except by first removing all domain contents.

To activate stemming for an empty NLP domain, set the $$$IKPSTEMMING domain parameter to 1. Stemming is inactive by default. To deactivate stemming for an empty NLP domain, set $$$IKPSTEMMING to 0.

The domain must be empty when setting this parameter. When you add source texts to a domain where stemming is activated, NLP creates a stem form for every entity and populates appropriate data structures so that stems can be queried.

Create an instance of the stemmer and specify the default language (or languages, using %iKnow.Stemming.MultiLanguageConfig). Specify languages using their ISO 639-1 two-character abbreviations.

Hunspell

InterSystems IRIS provide Hunspell files for Russian (ru) and Ukrainian (uk) in the dev/hunspell directory. You should place Hunspell dictionary files for other languages in the InterSystems IRIS dev/hunspell directory.

NLP modifies the behavior of Hunspell stemming in one important respect. Hunspell stemming removes prefixes from words. NLP restores prefixes removed by Hunspell before indexing the resulting stem forms.

If Hunspell returns more than one possible stem for a word, NLP will attempt to disambiguate these options using the entity context to determine if the entity is a concept or a relation.

Stem Retrieval Methods

The following %iKnow.Queries.EntityAPIOpens in a new tab methods return stem values:

You can also return the frequency and spread for a specified stem ID.

Using Stems

Many NLP query methods allow you to query on stems rather than on entities. By setting the method argument pUseStems=1, these query methods return values based on stem form values rather than exact entity values. The default, pUseStems=0 causes these query methods to return values based on exact entity values.

For example, the GetTop()Opens in a new tab method, by default, returns the most frequently occurring entities in a domain. If you specify pUseStems=1, this method returns the most frequently occurring stem forms in a domain, potentially merging together as a single stem form multiple entities that differ only in grammatical form.

The semantic proximity GetProfile()Opens in a new tab method can also set pUseStems=1 to return the proximity profile of stem forms, rather than entities, in a domain.

If you set pUseStems=1 in a domain that does not support stemming, the method returns no results.

For further details, refer to the NLP Queries chapter of this manual.

FeedbackOpens in a new tab