For information on converting to InterSystems IRIS, see the
InterSystems IRIS Adoption Guide
and the InterSystems IRIS In-Place Conversion Guide,
both available on the WRC Distributions page (login required).
Inherited description: The default dictionary for properties of this class. By overriding the
DICTIONARY you can create separate dictionaries for different kinds
of properties in the same language. For example, email documents, legal briefs, and
medical records might each have a separate dictionary so that term frequency and document
similarity can be appropriately estimated in each separate domain.
parameter NOISEWORDS100 = the of and a to in is you that it he for was on are as with his they at be this from I have or by one had not but what all were when we there can an your which their said if do will each about how up out them then she many some so these would other into has more her two like him see time could no make than first been its who now my made over did down only way find use may long little very after called just where most know get through back;
Inherited description: NOISEWORDSnnn lists the most common words in the language, in order of their frequency of occurrence.
See http://www.ranks.nl/stopwords/ for a list of commonly used noise words for many different languages.
parameter NOISEWORDS200 = much before go good new write our used me man too any day same right look also around another came come work three word must because does part even place well such here take why things help put years different away again off went old number great tell men say small every found still between name should Mr Mrs home big give set own under read last never us left end along while might next below saw something thought both few those always looked show often together asked don going want people water words air line sound large house;
parameter NOISEWORDS300 = world school important until 1 form food keep children feet land side without boy once animals life enough took sometimes four head above kind began almost live page got earth need far hand high year mother light parts country father let night following 2 picture being study second eyes soon times story boys since white days ever paper hard near sentence better best across during today others however sure means knew its try told young miles sun ways thing whole hear example heard several change answer room against top turned 3 learn point city play toward five using himself usually;
parameter SOURCELANGUAGE = en;
Inherited description: SOURCELANGUAGEUAGE specifies the default source language to translate
documents or queries from. This enables documents written and stored in multiple langauges to
be queried in a single common language.
cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.
cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.
m() measures the number of consonant sequences between positions k0=1 and j.
if c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
vc gives 1
vcvc gives 2
vcvcvc gives 3
The main part of the stemming algorithm starts here. b is a buffer
holding a word to be stemmed. The letters are in b[k0], b[k0+1] ...
ending at b[k]. k is readjusted downwards as the stemming progresses.
Note that only lower case sequences are stemmed. Forcing to lower case
should be done before stem(...) is called.