NLP supports stemming, an optional feature when performing NLP indexing of sources. Stemming allows NLP to recognize the stem form of a word that has several grammatical forms. For example, the stem of a noun is typically its nominative singular form. The stem of a verb is typically its infinitive form. Stemming occurs at the entity level. This means that NLP establishes the stem form of an entity, rather than the stem forms of the entity’s individual words. It is an additional level of normalization from the original source text.
Stemming support significantly increases NLP indexing space requirements and impacts NLP indexing performance. It is recommended that you only activate stemming for a domain when there is a real need for this functionality. Stemming is recommended for use with Russian and Ukrainian text sources.
Stemming is not performed on path-relevant entities.
Stemming is not supported for Japanese at this time.
Stemming may result in a stemmed entity that is no longer grammatically valid. For example, the stem for of the entity “two bananas” would be “two banana”. Because the stem form of an entity may not actually exist in the indexed sources, NLP pairs the stem form with a “representation form” which is the closest entity (as measured by the Levenshtein distance) to the stem form that actually exists as an indexed entity in the sources. The representative form can, of course, be identical to the stem form, if the stem form exists as an indexed entity in the sources.
NLP uses a plug-in architecture for generating stems, enabling the use of third-party stemming tools, if available. NLP uses Hunspell to generate stems. This means that when stemming is required, NLP searches for the Hunspell affix (.aff) file for the specified language in the INSTALLDIR/dev/hunspell directory. If no Hunspell dictionary is provided for the specified language, NLP searches for the %Text subclass for the specified language. InterSystems IRIS® data platform provides %Text datatype classes for five of the NLP supported languages: %Text.English (en), %Text.French (fr), %Text.German (de), %Text.Portuguese (pt), and %Text.Spanish (es). If neither a Hunspell dictionary file nor a language-specific %Text class is available, NLP uses the %Text.Text class. Refer to the %Text package class documentation in the InterSystems Class Reference for further details.
Stemming can only be activated or deactivated for an empty domain. Once a domain is populated with source texts, you cannot change the stemming option for that domain, except by first removing all domain contents.
To activate stemming for an empty NLP domain, set the $$$IKPSTEMMING domain parameter to 1. Stemming is inactive by default. To deactivate stemming for an empty NLP domain, set $$$IKPSTEMMING to 0.
The domain must be empty when setting this parameter. When you add source texts to a domain where stemming is activated, NLP creates a stem form for every entity and populates appropriate data structures so that stems can be queried.
Create an instance of the stemmer and specify the default language (or languages, using %iKnow.Stemming.MultiLanguageConfig). Specify languages using their ISO 639-1 two-character abbreviations.
InterSystems IRIS provide Hunspell files for Russian (ru) and Ukrainian (uk) in the dev/hunspell directory. You should place Hunspell dictionary files for other languages in the InterSystems IRIS dev/hunspell directory.
NLP modifies the behavior of Hunspell stemming in one important respect. Hunspell stemming removes prefixes from words. NLP restores prefixes removed by Hunspell before indexing the resulting stem forms.
If Hunspell returns more than one possible stem for a word, NLP will attempt to disambiguate these options using the entity context to determine if the entity is a concept or a relation.
Stem Retrieval Methods
The following %iKnow.Queries.EntityAPI methods return stem values:
GetStem() returns the stem string corresponding to a specified entity string.
GetStemId() returns the stem ID for a specified stem string, if the stem string exists within the domain.
GetStemValue() returns the stem string for a specified stem ID.
GetStemRepresentationForm() returns the representation form string for a specified stem ID.
You can also return the frequency and spread for a specified stem ID.
Many NLP query methods allow you to query on stems rather than on entities. By setting the method argument pUseStems=1, these query methods return values based on stem form values rather than exact entity values. The default, pUseStems=0 causes these query methods to return values based on exact entity values.
For example, the GetTop() method, by default, returns the most frequently occurring entities in a domain. If you specify pUseStems=1, this method returns the most frequently occurring stem forms in a domain, potentially merging together as a single stem form multiple entities that differ only in grammatical form.
If you set pUseStems=1 in a domain that does not support stemming, the method returns no results.
For further details, refer to the NLP Queries chapter of this manual.