datatype class %Text.English extends %Text.Text
ODBC Type: VARCHARSee %Text.Text
The %Text.English class implements the English language-specific stemming algorithm and initializes the language-specific list of noise words.
parameter DICTIONARY = 2;
Inherited description: The default dictionary for properties of this class. By overriding the DICTIONARY you can create separate dictionaries for different kinds of properties in the same language. For example, email documents, legal briefs, and medical records might each have a separate dictionary so that term frequency and document similarity can be appropriately estimated in each separate domain.
parameter NOISEBIGRAMS100 = thousand dollar,last night,twenti five,half hour,five hundr,hundr fifti,next morn,feet high,never heard,sundai school,hundr dollar,never mind,don want,hundr mile,never seen,hundr feet,human be,pretti soon,few dai,four hundr,those dai,those peopl,never saw,hundr thousand,per cent,human race,young ladi,look upon,hundr yard,half dozen,young fellow,ever seen,young girl,yes sir,four hour,twenti four,sever time,ten thousand,ever sinc,don care,five minute,fell upon,don think,ten dai,thousand feet,sure enough,six hundr,ever saw,thirti five,ten minute,should think,didn want,col seller,four five,five thousand,ask question,let alone,thousand mile,five mile,ever mark,whole thing,pilot hous,five six,everi night,differ between,hundr ago,half past,both side,yrs ever,middl ag,ever heard,next letter,don mind,noth els,few minute,without doubt,scienc health,don mean,fifteen minute,anybodi els,week ago,women children,dear sir,anyth els,shall never,left hand,everi thing,sai don,never got,human nature,half mile,don believ,centuri ago,never thought,last year,sort thing,six month,poor thing,next moment;
parameter NOISEBIGRAMS200 = poor fellow,five dollar,sai myself,feet above,worth while,sincere your,four dai,month ago,thou art,mother church,gener grant,letter written,fifti mile,keep still,wait till,someth els,low voic,seven hundr,run across,never anyth,ladi gentlemen,everi year,dai ago,ain got,ain go,ten mile,six feet,hour half,fifti dollar,eight hundr,don don,shook head,own hand,onc twice,never never,mont blanc,feet deep,without know,side side,sever dai,last moment,hour ago,think think,feet wide,don ever,depend upon,twenti minute,thou shalt,thing done,talk talk,rest upon,mile below,left behind,god bless,five feet,face face,six seven,four thousand,five cent,dai later,thousand time,quarter mile,hand upon,found himself,boi girl,read book,quarri farm,last week,gener thing,eye upon,clock morn,noth left,father peter,year year,ten twelv,nobodi ever,hour hour,haven got,four time,fifteen hundr,don rememb,didn anyth,stood still,somebodi els,poor creature,hundr time,forti five,young peopl,yes yes,whole world,twenti seven;
parameter NOISEBIGRAMS300 = four feet,upon head,everybodi els,etc etc,done done,don anyth,thou hast,thing ever,six thousand,set forth,odd end,month later,hundr twenti,hour later,fifti thousand,didn seem,care noth,yet never,till got,ten dollar,own self,never let,minute later,fifti ago,far wide,everi bodi,confer upon,call mind;
parameter NOISEWORDS100 = the of and a to in is you that it he for was on are as with his they at be this from I have or by one had not but what all were when we there can an your which their said if do will each about how up out them then she many some so these would other into has more her two like him see time could no make than first been its who now my made over did down only way find use may long little very after called just where most know get through back;
Inherited description: NOISEWORDSnnn lists the most common words in the language, in order of their frequency of occurrence. See http://www.ranks.nl/stopwords/ for a list of commonly used noise words for many different languages.
parameter NOISEWORDS200 = much before go good new write our used me man too any day same right look also around another came come work three word must because does part even place well such here take why things help put years different away again off went old number great tell men say small every found still between name should Mr Mrs home big give set own under read last never us left end along while might next below saw something thought both few those always looked show often together asked don going want people water words air line sound large house;
parameter NOISEWORDS300 = world school important until 1 form food keep children feet land side without boy once animals life enough took sometimes four head above kind began almost live page got earth need far hand high year mother light parts country father let night following 2 picture being study second eyes soon times story boys since white days ever paper hard near sentence better best across during today others however sure means knew its try told young miles sun ways thing whole hear example heard several change answer room against top turned 3 learn point city play toward five using himself usually;
parameter SOURCELANGUAGE = en;
Inherited description: SOURCELANGUAGEUAGE specifies the default source language to translate documents or queries from. This enables documents written and stored in multiple langauges to be queried in a single common language.
Returns TRUE if character is a consonant, else returns FALSE
cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g. cav(e), lov(e), hop(e), crim(e), but snow, box, tray.
m() measures the number of consonant sequences between positions k0=1 and j. if c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence,
vc gives 1
vcvc gives 2
vcvcvc gives 3
The main part of the stemming algorithm starts here. b is a buffer holding a word to be stemmed. The letters are in b[k0], b[k0+1] ... ending at b[k]. k is readjusted downwards as the stemming progresses. Note that only lower case sequences are stemmed. Forcing to lower case should be done before stem(...) is called. See: http://www.tartarus.org/~martin/PorterStemmer/c.txt
gets rid of plurals and -ed or -ing.
turns terminal y to i when there is another vowel in the stem.
maps double suffixes to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.
Replace -ic-, -full, -ness etc. similar strategy to step2.
Take off -ant, -ence etc., in context
Remove a final -e if m() > 1, and change -ll to -l if m() > 1.