Using iKnow
Text Categorization
[Back] [Next]
Go to:

Text categorization allows you to assign category labels to source texts, based on the contents of those texts.

For example, suppose you anticipate having a large number of Aviation Event texts that you will wish to categorize by AircraftCategory label: “airplane”, “helicopter”, “glider”, and so forth. By determining what iKnow entities in the contents of these sample texts correspond strongly to a category label, you can create a text classification model that you can apply to future Aviation Event texts that do not yet have an assigned category.
Defining appropriate categories is an essential preliminary to text categorization:
Text Categorization Implementation
iKnow supports two approaches to building a text classification model:
The descriptions that follow apply to the analytic approach to building a text classification model. Note that an analytic approach can use any analytic method, such as Naive Bayes statistical analysis or user-defined decision rules.
To perform text categorization, you must first create a Text Classifier (a text classification model). This model is based on a training set of source texts that have already been assigned category labels. By analyzing the contents of these training set texts, iKnow determines which iKnow entities correspond strongly to which category. You build and test a Text Classifier that statistically associates these iKnow text entities with categories. Once you have an accurate Text Classifier, you can then use it on new source texts to assign category labels to them.
Typically, categories are specified in a metadata field. Each text is associated with a single category label. The number of different category labels should be low relative to the number of source texts, with each category well represented by a number of texts in your training set.
iKnow text categorization starts from iKnow entities (not just words) within the source texts. It can use in its analysis not only the frequency of entities within the source texts, but the context of the entity, such as whether the entity is negated, and the entity’s appearance in larger text units such as CRCs and sentences. By using the full range of iKnow semantic analysis, text categorization can provide precise and valuable categorization of texts.
Analytic text categorization consists of three activities:
Implementation Interfaces
You can implement a text classification model in either of two ways:
Establishing a Training Set and a Test Set
Regardless of which interface you use, before building a Text Classifier you must load into a domain a group of data sources with associated category labels. These sources are used to train and test the Text Classifier.
It is possible to create a rules-based Text Classifier that does not require a pre-existing group of sources with assigned category labels. However, in the examples in this chapter the use of training set and test set sources is required.
You need to be able to divide these loaded sources into (at least) two groups of sources. A training set of sources, and a test set of sources. You use the training set to establish what entities are good indicators for particular categories. You use the test set (or multiple test sets) to determine if this predictive assignment of category labels makes sense with sources other than the training set. This prevents “overfitting” the terms to a particular group of sources. It is desirable that the training set be the larger of the two sets, containing roughly 70% of the sources, with the remaining 30% as the test set.
One common method for dividing SQL sources into a training set and a test set is to use a field of the source as a metadata field. You supply less than (<) and greater than (>) operators to AddField() so that you can perform a boolean test on the values of that field, dividing the sources into two groups. This division of sources should be as random as possible; using the SQL RowID as the metadata field usually achieves this goal.
The Management Portal Text Categorization Model builder is designed to use the values of a metadata field to divide a group of sources into a training set and a test set.
The following example establishes the SQL RowID as a metadata field that can be used to divide the loaded sources into a training set and a test set:
  SET myquery="SELECT ID,SkyConditionCeiling,Type,NarrativeFull FROM Aviation.Event"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")   // text data field
  SET metaflds=$LB("SkyConditionCeiling","ID")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"SkyConditionCeiling")  // categories field
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"ID",$LB("=","<=",">")) // set divider field
You can also divide loaded sources of any type into groups using iKnow source ID values. You can use the %iKnow.Filters.SourceIdFilter class to divide a group of sources into a training set and a test set. The following example uses modulo division on the source IDs to place two-thirds of the loaded sources in tTrainingSet, and the remaining sources in tTestSet:
     SET numsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
     DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,numsrc)
       SET j=1
       SET filtlist=""
       WHILE $DATA(result(j)) {
         SET intId = $LISTGET(result(j),1)
         IF intId#3 > 0 {SET filtlist=filtlist_","_intId }
         SET j=j+1
  SET tTrainingSet=##class(%iKnow.Filters.SourceIdFilter).%New(domId,filtlist)
  SET tTestSet = ##class(%iKnow.Filters.GroupFilter).%New(domId, "AND", 1) // NOT filter
  DO tTestSet.AddSubFilter(tTrainingSet)
  SET trainsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTrainingSet)
     WRITE "The training set contains ",trainsrc," sources",!
  SET testsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTestSet)
     WRITE "The test set contains ",testsrc," sources",!
Note that %iKnow.Filters.RandomFilter is another way to divide a group of sources. However, each time you invoke %iKnow.Filters.RandomFilter the resulting training set consists of different sources.
Building a Text Classifier Programmatically
To build a text classifier, you use a %iKnow.Classification.Builder object. The description that follows applies to the analytic approach to building a Text Classifier.
Create a Text Classifier
To create a Text Classifier, you must first instantiate a Builder object, supplying it the domain name and the oref for the training set. You then configure the ClassificationMethod algorithm that the Text Classifier will use. The easiest-to-use algorithm is based on the Naive Bayes theorem. Naive Bayes combines individual entities’ probabilities for each category in the training set to calculate the overall probability of a new text belonging to that category:
  SET tBuilder = ##class(%iKnow.Classification.IKnowBuilder).%New("mydomian",tTrainingSet)
  SET tBuilder.ClassificationMethod="naiveBayes"
You then specify the categories that the Text Classifier will use. If your sources supply the category labels as a metadata field, you can make a single call to the %LoadMetadataCategories() method. You do not need to specify either the category values or even the number of categories. In the following example, the AircraftCategory metadata field of Aviation.Aircraft is used as a category field assigning each record to a category: “Airplane”, “Helicopter”, “Glider”, “Balloon”, etc. The following example shows the use of this metadata field to specify categories:
  SET myquery="SELECT TOP 100 E.ID,A.AircraftCategory,E.Type,E.NarrativeFull "_
  "FROM Aviation.Aircraft AS A,Aviation.Event AS E "_
  "WHERE A.Event=E.ID"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")
  SET metaflds=$LB("AircraftCategory")
  SET mstat=##class(%iKnow.Queries.MetadataAPI).AddField(domId,"AircraftCategory")
  IF mstat=1 { 
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds,metaflds) }
  WRITE tBuilder.%LoadMetadataCategories("AircraftCategory")
Note that this is a useful (but not an ideal) category field, because a large percentage of the records (>80%) are assigned to “Airplane”, whereas most other labels have only a handful of records assigned to them. Ideally, each category label should correspond to roughly equivalent numbers of texts. As long as each category represents at least 10% of the training set, most classification methods should work fine. A category label must be associated with more than one source text; it may, therefore, be useful to combine potential category values with very low numbers of texts into a catch-all category, with a category label such as “miscellaneous”.
Populate the Terms Dictionary
Once you have established categories, you select terms which the Text Classifier will locate in each text and use to determine what category label to assign to it. You can either assign terms individually, or use %PopulateTerms() to add multiple terms found in the texts according to some metric.
%PopulateTerms() allows you to automatically specify a number of terms based on their frequency in the texts. By default the terms are selected using the Naive Bayes algorithm (most differentiating per-category probability):
  SET stat=tBuilder.%PopulateTerms(50)
Implementations for metrics other than Naive Bayes can be provided by subclasses. You can use %PopulateTerms() to specify the top X number of terms from the training set documents using the BM25 or TFIDF algorithm.
You will typically use a combination of %PopulateTerms() and %AddEntity() methods to create the desired set of terms.
You specify individual terms to include in the Text Classifier, using the %AddEntity(), %AddCRC(), and %AddCooccurrence() methods:
%AddEntity() can add an entity as a single term, or add multiple entities as a single composite term by supplying the entities as an array or list. iKnow aggregates the counts and scores of these entities, allowing you to capture synonyms or group variants of a term.
  DO tBuilder.%AddEntity("hang glider")
  DO tBuilder.%AddEntity("fixed wing aircraft","explicit","partialCount")
    SET tData(1)="helicopter",tData(2)="helicopters",tData(3)="twin-rotor helicopter"
  DO tBuilder.%AddEntity(.tData)
%AddEntity() can optionally specify how to handle negation and how to handle partial matches, as shown in the second %AddEntity() in the previous example.
%AddCRC() can add a CRC as a single term. Because text classification depends on the frequency of matches amongst the source texts, it is unusual for a CRC to be common enough to be useful as a Text Classifier term. However, if there is a very specific sequence of entities (a CRC) that is a strong indicator for a particular category, adding CRCs can make sense.
%AddCooccurrence() allows you to add as a single term the appearance of two specified entities in the same sentence (in any order). You can optionally specify how to handle negation and how to handle partial matches:
  WRITE tBuilder.%AddCooccurrence($LISTBUILD("landed","helicopter pad"))
Note that these terms are not associated with a particular category. The Builder will automatically calculate how well each text containing these terms correlates to each category.
Run the Classification Optimizer
When developing a Text Classifier, you do not have to add or remove terms by trial and error. You can use the methods of the %iKnow.Classification.Optimizer class to include those entities that will have the largest impact on predictive accuracy.
  1. Create an Optimizer object and use its Builder property to specify the %iKnow.Classification.Builder object to associate with it. Optionally, set the ScoreMetric property to specify how you want to measure performance (the default is MacroFMeasure).
      SET tOpt = ##class(%iKnow.Classification.Optimizer).%New(domId,tBuilder)
      SET tOpt.ScoreMetric="MicroPrecision"
  2. Include a large number of candidate terms, either from an array (using LoadTermsArray()) or using an SQL query (using LoadTermsSQL().
  3. Run the Optimize() method. This will automatically add terms and remove terms based on their ScoreMetric values. Optimize() performs the specified number of rounds of adding potentially high-value terms, calculating their impact, then removing low-value terms.
Generate the Text Classifier
Once you have identified categories and terms, you generate a Text Classifier class. This text classifier class contains code to identify the most appropriate category label based on the terms found in the source. You specify a class name for your Text Classifier:
   WRITE tBuilder.%CreateClassifierClass("User.MyClassifier")
The operation performed by this method depends on the ClassificationMethod you specified. For Naive Bayes, the Builder first creates a matrix containing the match score/count for each term in each source text for which we also know the actual category. This builds a model of how well the specified terms are predictive of the assigned category.
In the example used here, the categories were taken from the AircraftCategory metadata field values. Each term is correlated with each source to determine how predictive that term is in determining the category. For example, the appearance of the term “helipad” is strongly predictive of a source with AircraftCategory=helicopter. The term “engine” is indicative of several categories — airplane or helicopter, but not glider or balloon — and is thus weakly predictive of a single category. However, including a term of this type may be helpful for eliminating some categories. The term “passenger” is only weakly predictive of any category, and is therefore probably not a good term for your text classifier model. You can use %AddEntity() and %RemoveTerm() to fit your dictionary of terms based on their contribution to the determination of a category.
Testing a Text Classifier
Your text classifier model has been fitted to its training set of documents so that the set of terms in its term dictionary accurately determine the category. You now need to test the model on a separate set of documents to determine if it is accurate for documents other than those in the training set. For this, you use the test set of documents. Like the training set, these documents also have a defined category label.
You can use the %TestClassifier() method to return a single accuracy value. The accuracy is the number of right predictions made divided by the total records tested. The higher the accuracy against the test set documents, the better the model.
   WRITE tBuilder.%TestClassifier(tTestSet,,.accuracy),!
   WRITE "model accuracy: ",$FNUMBER(100*accuracy,"L",2)," percent"
It is likely that the predictive accuracy for all categories is not the same. You should therefore test the accuracy for individual categories.
The following example returns both the overall accuracy and the individual incorrect prediction results:
  WRITE tBuilder.%TestClassifier(tTestSet,.testresult,.accuracy),!
  WRITE "model accuracy: ",$FNUMBER(accuracy*100,"L",2)," percent",!
  SET n=1
  SET wrongs=0
  WHILE $DATA(testresult(n)) {
    IF $LISTGET(testresult(n),2) '= $LISTGET(testresult(n),3) {
      SET wrongcnt=wrongcnt+1
      WRITE "WRONG: ",$LISTGET(testresult(n),1)
      WRITE " actual ",$LISTGET(testresult(n),2)
      WRITE " pred. ",$LISTGET(testresult(n),3),! }
    SET n=n+1 }
  WRITE wrongcnt," out of ",n-1,!
Predictive accuracy for a category is calculated based on four possible outcomes of matching a prediction to a known category:
These counts are used to generate the following ratios:
Precision is the ratio of correct results to the number of results returned for a particular category: TP / (TP+FP). For example, the term “helipad” would contribute to a high precision ratio for the category Helicopter; nearly all texts that mention “helipad” are in the category Helicopter.
Recall is the ratio of correct results to the number of results that should have been returned for a particular category: TP / (TP+FN). For example, the term “helipad” is not likely to improve the recall ratio for the category “Helicopter” because only a few of these texts mention “helipad”.
The F-measure (F1) of the model for Category X combines the Precision and Recall values and derives the harmonic mean value of the two. Note that an increase in Precision may cause a decrease in Recall, and vice versa. Which of the two you wish to maximize depends on your use case. For example, in a medical screening application you may wish to accept more False Positives to minimize the number of False Negatives.
If a category value was not found in the training set, the Text Classifier cannot predict that category for a text in the test set. In this case, both the True Positive (TP) and False Positive (FP) will be zero, and False Negative (FN) with be the full count of texts with that category specified.
Using Test Results
If there is a significant discrepancy between the accuracy of the training set and the accuracy of the test set, the terms dictionary has been “overfitted” to the training set. To correct this problem, go back to the Build process and revise the term dictionary. You can generalize the term dictionary by replacing an individual term with a term array:
  SET stat=tBuilder.%RemoveTerm("Bell helicopter")
  SET tData(1)="Bell helicopter",tData(2)="Bell 206 helicopter",tData(3)="Bell 206A helicopter",
      tData(4)="Bell 206A-1 helicopter",tData(5)="Bell 206L helicopter",tData(6)="Bell 206L LongRanger"
  SET stat=tBuilder.%AddEntity(.tData)
You can also generalize the term dictionary by changing an individual term to allow for partial matches ("partialCount"), rather than only an exact match.
Building a Text Classifier Using the UI
You can build a Text Classifier using the Caché Management Portal. From the Management Portal System Explorer option, select the iKnow option, then select Text Categorization. This displays two options: Model builder and Model tester.
All iKnow domains exist within a specific namespace. Therefore, you must specify which namespace you wish to use. A namespace must be iKnow-enabled before it can be used. Selecting an iKnow-enabled namespace displays the iKnow Text Categorization option.
You cannot use the %SYS namespace for iKnow operations. The Management Portal iKnow option is non-functional (greyed out) while in the %SYS namespace. Therefore, you must specify which existing namespace you wish to use by clicking the Switch option at the top of any Management Portal interface page before using the iKnow option.
You may also need to activate %iKnow UI classes for your web application. Open the Caché Terminal and run the activation utility for the desired namespace, as follows: DO EnableIKnow^%SYS.cspServer("/csp/samples/"). (This example activates the Samples namespace.)
In this interface, you can either open an existing Text Classifier or build a new one. To build a new Text Classifier, you must already have a defined domain containing data sources. The data sources must contain a category field.
Define a Data Set for the UI
To create a new Text Classifier, you must have created a domain and populated it with data sources that can be used as the training set and the test set. Commonly, data from these sources should specify the following:
The following is an example of data source field definitions:
  SET myquery="SELECT TOP 200 E.ID,A.AircraftCategory,E.Type,E.NarrativeFull "_
     "FROM Aviation.Aircraft AS A,Aviation.Event AS E "_
     "WHERE A.Event=E.ID"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")
  SET metaflds=$LB("AircraftCategory","ID")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"AircraftCategory")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"ID",$LB("=","<=",">"))
Build a Text Classifier
To create a new text classifier, select the New button. This displays the Create a new classifier window. Create a new Class name for your Text Classifier. Select an iKnow Domain from the drop-down list of existing domains. Select the Category field containing the category labels from the drop-down list of metadata fields defined for the domain. For Training Set, select a metadata field from the drop-down list, an operator from the drop-down list, and specify a value. For example: EventDate <= 2004. For the Test Set, select the same metadata field, and a complementary operator and value. For example: EventDate > 2004. Alternatively, you can specify the Training Set and Test Set using SQL with an SQL query.
For Populate terms select a method of deriving terms from the drop-down list, and specify the number of top terms to derive. For example Top n terms by NB differentiation for Naive Bayes (NB). Then click Create.
This displays a screen with three panels:
The right panel displays the Model properties. Normally, you would not change these values. Clicking on the Data source domain name allows you to change the Category field, Training Set, or Test Set specified in the previous step. If you click the Gears icon in the button bar, you'll display some additional advanced controls.
The central panel (Selected terms) shows a tree view of the terms that you have already selected as part of the model. The left panel (Add terms) allows you to add terms to the model.
Terms Selection
Here are the most common ways to add entities to a terms dictionary:
You can use the check boxes to select individual entities as terms, or you can scroll to the bottom of the list and click select all for the current page of listed entities. You can go on to additional pages, if desired. Once you have selected entities, scroll back up to the top of the Add terms list and click the Add button. This adds the selected entities to the Selected terms list in the central panel. (You may get an informational message if you have selected duplicate terms.)
Optional Advanced Term Selection
The Append button allows you to create composite terms. In a composite term, multiple similar entities all resolve to the same single term. Defining composite terms helps to make the Text Classifier more robust and general, less “fitted” to specific entities in the training set. You can also create a composite term in the Selected terms list by dragging and dropping one term atop another.
By default, all multiword entities you select are added to the terms dictionary as an exact occurrence: all of the words in the entity must appear in the specified order in a text to qualify as a match. However, you can instead specify individual entities as partial match entities. When adding a multiword entity, click the Gears icon to display the Count drop-down box. Select partial match count. Defining common multiword entities as partial match terms helps to make the Text Classifier more robust and general, less “fitted” to specific entities in the training set.
You can use the CRCs tab or the Cooccurrences tab to add these entities to the terms dictionary. Type an entity in the box provided, then press Enter. The top CRCs and Cooccurrences appear in the Add terms list in descending order. (A cooccurence is the appearance of two Concept entities, in any order, in the same sentence.) Commonly, most CRCs and cooccurences are too specific to add to a term dictionary. However, you may wish to add the most common CRCs and/or cooccurrences to the term directory.
You can remove terms from the Selected terms list in the central panel; click on an individual term (or entity within a composite term), then click the Remove button.
When you have finalized the Model properties and list of Selected terms, click the Save button (or the Save as button). This builds your Text Classifier.
Optimize the Text Classifier
Once you have saved a Text Classifier, you can use the Optimizer to automatically optimize the term dictionary. The Optimizer takes a list of additional candidate terms, tests each term, and adds those terms with the highest impact on the accuracy of the Text Classifier.
Click the Optimize button. This displays the Optimize term selection popup window. Select a Relevance metric (BM25, TFIDF, or Dominance) from the drop-down list. Specify a number of candidate terms using that metric, and click Load. The right panel lists the Candidate terms to test. Click Next.
This displays the Settings panel with default values. You can accept these defaults and click Start. This runs the Optimizer, adding and removing terms. When the optimization process completes, you can close the Optimize term selection popup window. Note that the Selected terms list in the central panel has changed. Click the Save button to save these additions to your terms dictionary.
You can run the Optimizer several times with different settings. After each optimization you can test the Text Classifier, as described in the following section.
Test the Text Classifier against a Test Set of Data
Click the Test button. This displays the Model tester in a new browser tab. The Model tester allows you to test your Text Classifier against test data. After testing it, you can return to the Model builder browser tab and add or remove terms, either manually using Add terms , or by running (or re-running) the Optimizer.
The Model tester provides two ways of testing your Text Classifier:
Test the Text Classifier on Uncategorized Data
You can use the Text Classifier on a text string to test how it derives a category. Select the Test button. This displays the Text input window. Specify a text string, then press Categorize!.
Using a Text Classifier
Once you have built an accurate text classifier, you will want to apply it to source texts that have not yet been assigned a category label. Using methods of %iKnow.Classification.Classifier, your text classifier can be used to predict the category of any unit of text.
To create a Text Classifier, you must first instantiate the subclass of %iKnow.Classification.Classifier that represents the Text Classifier you wish to run. Once created, a text classifier is completely portable; you can use this text classifier class independently of the domain that contains the training set and test set data.
ZWRITE categories returns match score data such as the following:
In SQL you can execute a Text Classifier against a text string using a method stored procedure, as follows:
SELECT User.MyClassifier_sys_CategorizeSQL('input string') AS CategoryLabel
This returns a category label.
The following Embedded SQL example uses the Sample.MyTC Text Classifier to determine a category label for the first 25 records in Aviation.Event:
  ZNSPACE "Samples"
  FOR i=1:1:25 {
     SET rec=i
     &sql(SELECT %ID,NarrativeFull INTO :id,:inputstring FROM Aviation.Event WHERE %ID=:rec)
         WRITE "Record ",id
     &sql(SELECT Sample.MyTC_sys_CategorizeSQL(:inputstring) INTO :CategoryLabel)
        WRITE " assigned category: ",CategoryLabel,!
Once texts have been classified, you can use this classification to filter texts using the %iKnow.Filters.SimpleMetadataFilter class. See Filtering by User-defined Metadata for further details.