%Text.Text
datatype class %Text.Text extends %Library.String
ODBC Type: VARCHAR
The %Text.Text data type class implements the methods used by Caché for full text indexing, text search, similarity scoring, automatic classification, dictionary management, word stemming, n-gram key creation, and noise word filtering.Usage
Creating a Text Property and a Full-Text Index
To create a %Text property and an index that supports Boolean queries, declare the property using the %Library.Text class and create a full-text index on the property specifying (KEYS) in the ON clause as shown below
PROPERTY myDocument As %Text (MAXLEN = 256, LANGUAGECLASS = "%Text.English"); INDEX myIndex ON myDocument(KEYS) [ TYPE=BITMAP ];
- MINWORDLEN discards all terms that are fewer than this number of characters
- FILTERNOISEWORDS=1 enables common-word filtering, in combination with calling the ExcludeCommonTerms() class method. Calling ExcludeCommonTerms() with an argument of 175 causes the 175 most common words and two-word combinations to be ignored, resulting in a very substantial reduction of index size (also see Dictionary Management, below)
- STEMMING conflates multiple forms of a word to a common "stem". For example, in English, the common word endings -s, -ing, -ed, (etc.) may be removed so that the various word forms can all match against each other.
The %CONTAINS Operator
With the declarations above, the following SQL query could be issued to find all documents containing both the terms "Intersystems" and "Ensemble":SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('Intersystems', 'Ensemble')
SELECT myDocument FROM table t WHERE myDocument [ 'Intersystems' AND myDocument [ 'Ensemble'
The %CONTAINS operator may also be used to search for multi-word phrases, such as in the following query:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('New Guinea') OR myDocument %CONTAINS ('West Africa')
The next query illustrates the use of the STEMMING parameter. The language-specific subclasses of the %Text.Text class each strip off common word endings to put each term into a standard form.
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('jumping')
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('jumping through hoops')
Additional flexibility beyond what is available from the %CONTAINS operator can be obtained by using the FOR SOME %ELEMENT predicate. For example, wildcarding can be specified if STEMMING=0 and can optionally be combined with other WHERE clause predicates as follows:
SELECT myDocument FROM table t WHERE FOR SOME %ELEMENT(myDocument) (%KEY LIKE 'myo%opy') AND myDocument %CONTAINS ('heart')
The %SIMILARITY Operator
Many text-search applications require the ability to rank the results of a Boolean query by their relevance to a set of related terms. Caché supports this capability with the %SIMILARITY SQL extension The following example finds all documents containing the terms 'Intersystems' and 'Ensemble', and then ranks them in descending order of their similarity to any or all of the terms 'Intersystems Ensemble Queue Messaging':SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('Intersystems', 'Ensemble') ORDER BY %SIMILARITY (myDocument, 'Intersystems Ensemble Queue Messaging') DESC
Caché uses a state of the art similarity algorithm based on the Okapi BM25 term weighting strategy and the cosine similarity metric. If desired, you can adjust the Okapi BM25 model parameters OKAPIBM25B, OKAPIBM25K1, and OKAPIBM25K3 to fine-tune the ranking algorithm when there is a mixture of large and small documents that need to be ranked. Alternatively, you may override the default similarity algorithms with your own algorithms and/or special index structures.
The second operand to %SIMILARITY may be any text-valued expression, so to find documents that contain both the terms "Intersystems" and "Ensemble", but to rank the documents based on references to "integration", "platform", or "integration platform", the following query could be used:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('Intersystems', 'Ensemble') ORDER BY %SIMILARITY (myDocument, 'Integration platform') DESC
Dictionary Management
Just as the %CONTAINS operator may be used without an index or without %SIMILARITY ranking, %SIMILARITY ranking can be used without dictionary support; however, a critically important aspect of similarity ranking is the ability to assess the information content of different words. For example, the word "the" has low utility as a search term, whereas the word "London" is much more specific and useful as a search term.To reduce the size of the index, and to enable the similarity algorithm to more easily ignore words with low information content, you will usually want to call the ExcludeCommonTerms() class method to specify noise words for the current dictionary. By calling ExcludeCommonTerms with argument n and setting the class parameter FILTERNOISEWORDS=1, the n most common words and 2-word combinations in the current language will be ignored. For English text, the most common 100 words represent about 50% of all word occurrences.
Each language-specific subclass of the %Text.Text class is associated with a particular DICTIONARY identifier, so by default English words go into a different dictionary than French words, and so on; however, you can also create multiple dictionaries for each language. For example, it may be useful to have a separate dictionary for email than for legal briefs, because words that are common in one domain may be uncommon and useful in another domain.
To collect statistics about the frequency of different terms, call the AddDocToDictionary() class method. Since words that were rare yesterday are likely to be rare tomorrow (except in special applications like news feeds), the dictionary can be populated initially and then updated as an infrequent database maintenance operation (to rebuild the dictionary on a monthly or quarterly schedule, for example). For example, the following loop drops the current dictionary, then repopulates it:
do ##class(%Text.English).DropDictionary() do ##class(%Text.English).ExcludeCommonTerms(175) &sql(DECLARE C CURSOR FOR SELECT myDocument, category INTO :myDoc, :category FROM myTable T) &sql(OPEN C) QUIT:SQLCODE<0 SQLCODE for { &sql(FETCH C) QUIT:SQLCODE=100 do ##class(%Text.English).AddDocToDictionary(myDoc, category) &sql(CLOSE C)
You can find relevant documents more easily by specifying a dictionary-specific thesaurus. If the class parameter THESAURUS=1, then terms in each document and in each %CONTAINS predicate are replaced by the standard term in the thesaurus. The API for adding or removing a term from the English language thesaurus is:
do ##class(%Text.English).AddToThesaurus(term, standardTerm) do ##class(%Text.English).RemoveFromThesaurus(term)
do ##class(%Text.English).LoadThesaurus("EnglishThesaurus.txt")
Automatic Classification
The example above not only repopulates the English dictionary, it also associates a category with each document. For example, if myDocument is an email, then category might be "junk" or "normal", or if myDocument is a problem report, then category might be the name of the person who resolved the problem. Classifying documents in this fashion makes it possible to automatically classify new and unseen documents into one of the known categories based on the similarity of the previously unseen document with the documents in each category. The Classify() computes the probability that a given document belongs to each of the known categories, and returns a $list of the n most likely categories, in decreasing order of probability.A more whimsical (but hopefully interesting) example that illustrates the potential power of automatic classification would be to evaluate the true authorship of a document. A few literary scholars have speculated that some of the famous later works attributed to William Shakespeare were actually authored by Christopher Marlowe. Marlowe and Shakespeare attended the same school, and probably knew each other in England before Marlowe was forced to flee in secrecy and live in hiding in Italy. The theory is that Marlowe continued to publish his works in England through Shakespeare. If the theory is true, then The Merchant of Venice is among the works most likely to have been written by Marlowe since Marlowe lived in Italy, and Shakespeare is not known to have ever visited Italy. This question could be researched by calling AddDocToDictionary() to gather statistics about each passage in each work attributed to Marlowe to Marlowe, and each passage of each early work attributed to Shakespeare (up to the time of Marlowe's departure to Italy) to Shakespeare. The Classify class method could then directly estimate whether each passage of The Merchant of Venice is more similar to early works attributed to Shakespeare than to works attributed to Marlowe.
Method Inventory
- AddDocToDictionary()
- AddToDictionary()
- AddToThesaurus()
- BuildValueArray()
- ChooseSearchKey()
- Classify()
- CreateQList()
- DecompressOffsets()
- DropDictionary()
- EndOfWord()
- ExcludeCommonTerms()
- LoadThesaurus()
- MakeSearchTerms()
- RemoveDocFromDictionary()
- RemoveFromThesaurus()
- SeparateWords()
- Similarity()
- SimilarityIdx()
- Standardize()
- Translate()
- ends()
- setto()
- stemWord()
Parameters
Setting up noise word filtering is a two-step process: First enable noise word filtering by setting FILTERNOISEWORDS=1. Second, populate the noise word dictionary by calling the ExcludeCommonTerms() with the desired number of noise words to populate the corresponding DICTIONARY. ExcludeCommonTerms purges the previous set of noise words, so it may be called any number of times, but it is necessary to rebuild all text indexes on the corresponding properties whenever the list of noise words is changed.
Note: The SQL predicate:
SELECT myDocument FROM table t WHERE myDocument %CONTAINS ('to be or not to be')
The first ..#MAXOCCURS-1 positions, the last position, and the total count of occurrences are returned in the %value portion of the valueArray in the format: count ^ pos1 ^ deltaPos2 ^ deltaPos3... ^ deltaPosN-1^ posN, where the separator "^" is defined as the "metachar", and may be redefined if necessary. The "deltaPos" are delta-compressed positions, so the first and last positions are simple character offsets into the document. The second position can be recovered by summing pos1+deltaPos2, the third by summing pos1+deltaPos2+deltaPos3, and so on.
Note: MINWORDLEN should typically be set to 3 or less when STEMMING=1, since otherwise a word stem could be classified as a noise word even though alternate forms of the word would not be classified as a noise word. For example, with MINWORDLEN=5 "jump" would be discarded as a noise word, whereas "jumps" would not.
Note: Stemming of search strings is performed automatically by the %CONTAINS Cache SQL predicate if stemming is enabled on the corresponding property; however, stemming is not automatically performed by the more primitive FOR SOME %ELEMENT SQL predicate.
Methods
The statistics include the term count in $p(statistics,"#",1), and optionally include the character positions where the term appears in the document in subsequent #-delimited positions, where "#" is a non-word meta-character that may be redefined by an application if necessary.
Three special values are also returned in the valueArray:
- valueArray("#doclen") holds the number of non-noise terms in the document
- valueArray("#norm") holds a statistic needed by the cosine metric (see SimilarityIdx())
- valueArray holds the number of distinct terms in the document (the number of terms)
For background information (not used in this implementation), see www-2.cs.cmu.edu/~mccallum/bow/ Also see Dr. Dobb's Journal, May 2005
A basic explanation of Bayes' Rule is as follows:
Naive Bayes assumes a particular generative model for text documents. Assumptions built into the model are that (a) the data are produced by a mixture model, (b) there is a one-to-one correspondence between mixture components and classes, (c) the probability that any given word appears in a document is conditionally independent of the probability of appearance of any other word, and (d) the probability that document Di is associated with class Cj is independent of the length of the document.
Under these assumptions, the probability that document Di could be generated by parameters T is given by p(Di | T) = sum(p(Cj | T) * p(Di | Cj ; T),j=1:|C|), and p(Di | Cj ; T) = p(|Di|) * product( p(word(Di,k) | Cj ; T),k=1:|Di|)
Thus the parameters of an individual mixture component are a multinomial distribution over words, i.e. the collection of word probabilities. Since the model assumes that document length is identically distributed for all classes, it does not need to be parameterized to classify a document.
Learning a Naive Bayes classifier consists of estimating the parameters of the generative model by using a set of pre-classified training samples. The goal of the training procedure is to determine the parameters T that maximize p(T | class(Di) = Cj), i=1:|D|, j=1:|C|).
p(Category|Document) = ( p(Document|Category) * p(Category) ) / p(Document) exp(metric) = p(Document|Category) * p(Category) = Product(p(Word|Category)) * p(Category) exp(metric) = Product(count(word,doc)/count(word,corpus)) * (nWordsInCategory / nWordsInDictionary)
The resulting p(Document|Category) * p(Category) can then be compared across all categories to identify the category with maximum score, and hence the maximum p(Category|Document). This is the predicted category.
Note that the use of ..#NGRAMLEN>1 invalidates the mathematical justification for using Bayesian probabilities; however, biasing the probability score in favor of documents that match multi-word combinations is justifiable because it partially addresses the absence of the joint probability information that is the main deficiency of the naive Bayesian algorithm; therefore when ..#NGRAMLEN>1, we call this a "semi-naive" Bayesian classifier.
This feature is not available prior to Caché 2007.1
An index that supports both Boolean queries (the %CONTAINS operator) and ranking queries (the %SIMILARITY operator) may be created by removing TYPE=BITMAP and by specifying "[ DATA = myDocument(ELEMENTS) ]". If such an index is created, then also specify the name of the index in the SIMILARITYINDEX parameter of the corresponding property as follows:
PROPERTY myDocument As %Text (MAXLEN = 256, LANGUAGECLASS = "%Text.English", SIMILARITYINDEX="bigIndex"); INDEX bigIndex ON myDocument(KEYS) [ DATA=myDocument(ELEMENTS) ];
This method computes a score that relates the similarity of a query document to a reference document. Many similarity heuristics have been proposed, and have been shown to be effective on real data sets. A variation of one effective and commonly used statistic is the cosine measure:
SUM(w(q,t)*w(d,t)) for t in both q and d C(q,d) = ------------------------------------- SQRT(SUM(w(d,t)^2)*SQRT(SUM(w(q,t)^2) for all t
The weights w(d,t) and w(q,t) are Okapi BM25 weights, calculated as follows:
w(d,t) = dtf / (dtf + sizeAdj) dtf = term frequency in document sizeAdj = k1*((1-b) + (b * doclen/avgdoclen)) b = .75, k1 = 2 w(q,t) = qtf * IDF(N,f(t)) qtf = term frequency in query IDF(N,df) = (ln(N/df)+1) / (ln(N)+1) N = the number of documents classified df = document frequency, or #documents containing term
OkapiBM25 = SUM(QTF*ln(IDF)*DTF) Where: - IDF = (N-n+.5)/(n+.5) - DTF = (k1+1)*tf/((k1*sizeAdjD)+tf) - tf = frequency of occurrences of the term in the document - sizeAdjD = (1-b) + b*doclen/avgdoclen - QTF = (k3+1)*qtf/(k3+qtf) - qtf = frequency of occurrences of the term in the query - doclen = document length - avgdoclen = average document length - N = is the number of documents in the collection - n = is the number of documents containing the word - k1 = 1.2 - b = 0.75 or 0.25 (recommend .75 for full text and .25 for shorter representations) - k3 = 7, set to 7 or 1000, controls the effect of the query term frequency on the weight.