The %Text.Japanese class implements (or calls) the Japanese language-specific stemming algorithm
and initializes the language-specific list of noise words.
Inherited description: CASEINSENSITIVE=1 causes comparisons to be performed by %CONTAINS
in a case-insensitive manner when the collation of the underlying property is case
insensitive. Setting CASEINSENSITIVE=1 improves
matching and typically reduces both the size of the index and index update time.
Note that CASEINSENSITIVE is not applicable to the %CONTAINSTERM operator,
since %CONTAINSTERM always compares terms using the collation of the specified property.
parameter DICTIONARY = 6;
Inherited description: The default dictionary for properties of this class. By overriding the
DICTIONARY you can create separate dictionaries for different kinds
of properties in the same language. For example, email documents, legal briefs, and
medical records might each have a separate dictionary so that term frequency and document
similarity can be appropriately estimated in each separate domain.
parameter FILTERNOISEWORDS = 0;
Inherited description: FILTERNOISEWORDS controls whether common-word filtering is enabled.
Specifying a list of noise words can greatly reduce the size of a text index and the associated
index update time; however, to perform text search it is necessary to also remove noise words
from the search pattern, and this can produce some counter-intuitive results. See example below.
Setting up noise word filtering is
a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second,
populate the noise word dictionary by calling the ExcludeCommonTerms()
with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms
purges the previous set of noise words, so it may be called any number of times, but it is necessary
to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.
Note: The SQL predicate:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')
will not find any qualifying rows if 'to, be, or, not' are all noise words; however, if any
of these terms are not noise words, then only the non-noise words will participate in the matching
process.
parameter MINWORDLEN = 1;
Inherited description: MINWORDLEN specifies the minimum length word that will be retained
excluding ngram words and post-stemmed words.
MINWORDLEN provides a simple means of excluding terms based on their
length, since it is usually the case that short words such as 'a', 'to', 'an', etc., are
connectives that contain little information content. The length refers to the number of
characters in the original document. Note that if stemming or thesaurus translation is
enabled, then the length of the term in a text index may have fewer than MINWORDLEN
characters.
Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1,
since otherwise a word stem could be classified as a noise word even though alternate forms of the
word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded
as a noise word, whereas "jumps" would not.
parameter NGRAMLEN = 2;
Inherited description: NGRAMLEN is the maximum number of words that will be regarded as a single
search term. When NGRAMLEN=2, two-word combinations will be added to any
index, in addition to single words. Consecutive words exclude noise words.
parameter NUMCHARS;
Inherited description: NUMCHARS specifies the characters other than digits that may appear
in a number. Note that if "," is included in NUMCHARS, then "1,000" will be considered a
single number, but the comma will be removed so that "1,000" will match "1000" using the
%CONTAINS SQL predicate. The characters "." and "-" are also special and mark the beginning of
a numeric term when the next character is numeric, regardless of how NUMCHARS is defined.
parameter SEPARATEWORDS = 1;
Inherited description: Languages such as Japanese require the raw document text to be parsed and
separated into words before being processed by the class methods.
If SEPARATEWORDS=1 then call the SeparateWords() class method.
parameter SOURCELANGUAGE = ja;
Inherited description: SOURCELANGUAGEUAGE specifies the default source language to translate
documents or queries from. This enables documents written and stored in multiple langauges to
be queried in a single common language.
parameter STEMMING = 0;
Inherited description: STEMMING replaces each word by its language-specific stem to improve the
matching quality. Note that stemmed words are modified, and may or may
not correspond to real words in the language. If stemming is enabled, then
search patterns must also be stemmed prior to searching.
Note: Stemming of search
strings is performed automatically by the %CONTAINS Cache SQL predicate if stemming is enabled
on the corresponding property; however, stemming is not automatically performed by
the more primitive FOR SOME %ELEMENT SQL predicate.
Inherited description: Classifies the most common nTerms words in the current language as noise words. The words specified
in NOISEWORDS100, NOISEWORDS200, and NOISEWORDS300,
list the most common 300 words of the current language, in order of their frequency. Similarly, NOISEBIGRAMSn00 lists
the most common 300 bigrams of the current language that would not typically be considered useful for searching.
classmethod SeparateWords(rawText As %String) as %String
Inherited description: Separates individual terms with whitespace, for languages such as Japanese.