Skip to main content

Text Categorization

Important:

InterSystems has deprecatedOpens in a new tab InterSystems IRIS® Natural Language Processing (NLP). It may be removed from future versions of InterSystems products. The following documentation is provided as reference for existing users only. Existing users who would like assistance identifying an alternative solution should contact the WRCOpens in a new tab.

Text categorization allows you to assign category labels to source texts, based on the contents of those texts.

For example, suppose you anticipate having a large number of Aviation Event texts that you will wish to categorize by AircraftCategory label: “airplane”, “helicopter”, “glider”, and so forth. By determining what NLP entities in the contents of these texts correspond strongly to a category label, you can create a text classification model that you can apply to future Aviation Event texts that do not yet have an assigned category.

Defining appropriate categories is an essential preliminary to text categorization:

  • Each source can only be assigned one category. A source is assigned a category; a category is not assigned sources. Every source must correspond unambiguously to one of the defined categories.

  • The number of category values (labels) is fixed. Because you cannot add more category labels to a text classification model, all possible future sources should be assignable to one of the initial category labels.

  • The number of category values (labels) should be small. Categories should be designed so that roughly equal numbers of sources will be assigned to each category value.

Text Categorization Implementation

NLP supports two approaches to building a text classification model:

  • Analytic: analyze a set of existing texts that have category labels to determine which entities within the texts are the strongest indicators of membership in each of these categories. This requires you to have a representative sample of texts that have already been categorized.

  • Rules-based: perform NLP entity queries on a set of existing texts; for example, determine the top TFIDF or BM25 entities. Define categories using %AddCategory()Opens in a new tab. Develop rules (boolean tests) about the presence of high-value entities within the texts, associating specific categories with specific rules. Applying these rules will collectively determine the membership of a text in a category: highest number of successful boolean tests determines the category. This does not require you to have a set of texts that have already been categorized.

The descriptions that follow apply to the analytic approach to building a text classification model. Note that an analytic approach can use any analytic method, such as Naive Bayes statistical analysis or user-defined decision rules.

To perform text categorization, you must first create a Text Classifier (a text classification model). This model is based on a training set of source texts that have already been assigned category labels. By analyzing the contents of these training set texts, NLP determines which NLP entities correspond strongly to which category. You build and test a Text Classifier that statistically associates these NLP text entities with categories. Once you have an accurate Text Classifier, you can then use it on new source texts to assign category labels to them.

Typically, categories are specified in a metadata field. Each text is associated with a single category label. The number of different category labels should be low relative to the number of source texts, with each category well represented by a number of texts in your training set.

NLP text categorization starts from NLP entities (not just words) within the source texts. It can use in its analysis not only the frequency of entities within the source texts, but the context of the entity, such as whether the entity is negated, and the entity’s appearance in larger text units such as CRCs and sentences. By using the full range of NLP semantic analysis, text categorization can provide precise and valuable categorization of texts.

Analytic text categorization consists of three activities:

  • Build a Text Classifier. This requires a set of texts (the training set), each of which has been assigned a category label. In this step you select a set of terms (entities) that are found in these texts and may serve to differentiate them. You also select a ClassificationMethod (algorithm) that determines how to correlate the appearance of these terms in the training set texts with the associated category label.

  • Test the Text Classifier to determine its fit. This requires another set of texts (the test set), each of which has been assigned a category label. Based on this test information, you can revisit the build step, adding or removing terms, and thus iteratively improving the accuracy of the text classifier model.

  • Use the Text Classifier to categorize texts that do not have an assigned category.

Implementation Interfaces

You can implement a text classification model in either of two ways:

Managing your Text Classification Model

Regardless of the way you choose to train, optimize, generate, and test them, Text Classifiers are stored as class definitions by InterSystems IRIS. You can manage them in the same way as any other ObjectScript class in your InterSystems IRIS environment.

The Management Portal provides a Classes page where you can easily manage the Text Classifiers in your text classification model. You can use this page to:

Establishing a Training Set and a Test Set

Regardless of which interface you use, before building a Text Classifier you must load into a domain a group of data sources with associated category labels. These sources are used to train and test the Text Classifier.

Note:

It is possible to create a rules-based Text Classifier that does not require a pre-existing group of sources with assigned category labels. However, in the examples in this chapter the use of training set and test set sources is required.

You need to be able to divide these loaded sources into (at least) two groups of sources. A training set of sources, and a test set of sources. You use the training set to establish what entities are good indicators for particular categories. You use the test set (or multiple test sets) to determine if this predictive assignment of category labels makes sense with sources other than the training set. This prevents “overfitting” the terms to a particular group of sources. It is desirable that the training set be the larger of the two sets, containing roughly 70% of the sources, with the remaining 30% as the test set.

One common method for dividing SQL sources into a training set and a test set is to use a field of the source as a metadata field. You supply less than (<) and greater than (>) operators to AddField() so that you can perform a boolean test on the values of that field, dividing the sources into two groups. This division of sources should be as random as possible; using the SQL RowID as the metadata field usually achieves this goal.

The Management Portal Text Categorization Model builder is designed to use the values of a metadata field to divide a group of sources into a training set and a test set.

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

The following example establishes the SQL RowID as a metadata field that can be used to divide the loaded sources into a training set and a test set:

  SET myquery="SELECT ID,SkyConditionCeiling,Type,NarrativeFull FROM Aviation.Event"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")   // text data field
  SET metaflds=$LB("SkyConditionCeiling","ID")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"SkyConditionCeiling")  // categories field
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"ID",$LB("=","<=",">")) // set divider field

You can also divide loaded sources of any type into groups using NLP source ID values. You can use the %iKnow.Filters.SourceIdFilterOpens in a new tab class to divide a group of sources into a training set and a test set. The following example uses modulo division on the source IDs to place two-thirds of the loaded sources in tTrainingSet, and the remaining sources in tTestSet:

FilterBySrcId
     SET numsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId)
     DO ##class(%iKnow.Queries.SourceAPI).GetByDomain(.result,domId,1,numsrc)
       SET j=1
       SET filtlist=""
       WHILE $DATA(result(j)) {
         SET intId = $LISTGET(result(j),1)
         IF intId#3 > 0 {SET filtlist=filtlist_","_intId }
         SET j=j+1
      }
  SET tTrainingSet=##class(%iKnow.Filters.SourceIdFilter).%New(domId,filtlist)
  SET tTestSet = ##class(%iKnow.Filters.GroupFilter).%New(domId, "AND", 1) // NOT filter
  DO tTestSet.AddSubFilter(tTrainingSet)
DisplaySourceCounts
  SET trainsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTrainingSet)
     WRITE "The training set contains ",trainsrc," sources",!
  SET testsrc = ##class(%iKnow.Queries.SourceAPI).GetCountByDomain(domId,tTestSet)
     WRITE "The test set contains ",testsrc," sources",!

Note that %iKnow.Filters.RandomFilterOpens in a new tab is another way to divide a group of sources. However, each time you invoke %iKnow.Filters.RandomFilterOpens in a new tab the resulting training set consists of different sources.

Building a Text Classifier Programmatically

To build a text classifier, you use a %iKnow.Classification.BuilderOpens in a new tab object. The description that follows applies to the analytic approach to building a Text Classifier.

Refer to A Note on Program Examples for details on the coding and data used in the examples in this book.

Create a Text Classifier

To create a Text Classifier, you must first instantiate a Builder object, supplying it the domain name and the oref for the training set. You then configure the ClassificationMethod algorithm that the Text Classifier will use. The easiest-to-use algorithm is based on the Naive Bayes theorem. Naive Bayes combines individual entities’ probabilities for each category in the training set to calculate the overall probability of a new text belonging to that category:

  SET tBuilder = ##class(%iKnow.Classification.IKnowBuilder).%New("mydomian",tTrainingSet)
  SET tBuilder.ClassificationMethod="naiveBayes"

You then specify the categories that the Text Classifier will use. If your sources supply the category labels as a metadata field, you can make a single call to the %LoadMetadataCategories()Opens in a new tab method. You do not need to specify either the category values or even the number of categories. In the following example, the AircraftCategory metadata field of Aviation.Aircraft is used as a category field assigning each record to a category: “Airplane”, “Helicopter”, “Glider”, “Balloon”, etc. The following example shows the use of this metadata field to specify categories:

  SET myquery="SELECT TOP 100 E.ID,A.AircraftCategory,E.Type,E.NarrativeFull "_
  "FROM Aviation.Aircraft AS A,Aviation.Event AS E "_
  "WHERE A.Event=E.ID"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")
  SET metaflds=$LB("AircraftCategory")
  SET mstat=##class(%iKnow.Queries.MetadataAPI).AddField(domId,"AircraftCategory")
  IF mstat=1 { 
  SET stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds,metaflds) }
  .
  .
  .
  WRITE tBuilder.%LoadMetadataCategories("AircraftCategory")

Note that this is a useful (but not an ideal) category field, because a large percentage of the records (>80%) are assigned to “Airplane”, whereas most other labels have only a handful of records assigned to them. Ideally, each category label should correspond to roughly equivalent numbers of texts. As long as each category represents at least 10% of the training set, most classification methods should work fine. A category label must be associated with more than one source text; it may, therefore, be useful to combine potential category values with very low numbers of texts into a catch-all category, with a category label such as “miscellaneous”.

Populate the Terms Dictionary

Once you have established categories, you select terms which the Text Classifier will locate in each text and use to determine what category label to assign to it. You can either assign terms individually, or use %PopulateTerms() to add multiple terms found in the texts according to some metric.

%PopulateTerms()Opens in a new tab allows you to automatically specify a number of terms based on their frequency in the texts. By default the terms are selected using the Naive Bayes algorithm (most differentiating per-category probability):

  SET stat=tBuilder.%PopulateTerms(50)

Implementations for metrics other than Naive Bayes can be provided by subclasses. You can use %PopulateTerms() to specify the top X number of terms from the training set documents using the BM25 or TFIDF algorithm.

You will typically use a combination of %PopulateTerms() and %AddEntity() methods to create the desired set of terms.

You specify individual terms to include in the Text Classifier, using the %AddEntity(), %AddCRC(), and %AddCooccurrence() methods:

%AddEntity()Opens in a new tab can add an entity as a single term, or add multiple entities as a single composite term by supplying the entities as an array or list. NLP aggregates the counts and scores of these entities, allowing you to capture synonyms or group variants of a term.

  DO tBuilder.%AddEntity("hang glider")
  DO tBuilder.%AddEntity("fixed wing aircraft","explicit","partialCount")
    SET tData(1)="helicopter",tData(2)="helicopters",tData(3)="twin-rotor helicopter"
  DO tBuilder.%AddEntity(.tData)

%AddEntity() can optionally specify how to handle negation and how to handle partial matches, as shown in the second %AddEntity() in the previous example.

%AddCRC()Opens in a new tab can add a CRC as a single term. Because text classification depends on the frequency of matches amongst the source texts, it is unusual for a CRC to be common enough to be useful as a Text Classifier term. However, if there is a very specific sequence of entities (a CRC) that is a strong indicator for a particular category, adding CRCs can make sense.

%AddCooccurrence()Opens in a new tab allows you to add as a single term the appearance of two specified entities in the same sentence (in any order). You can optionally specify how to handle negation and how to handle partial matches:

  WRITE tBuilder.%AddCooccurrence($LISTBUILD("landed","helicopter pad"))

Note that these terms are not associated with a particular category. The Builder will automatically calculate how well each text containing these terms correlates to each category.

Run the Classification Optimizer

When developing a Text Classifier, you do not have to add or remove terms by trial and error. You can use the methods of the %iKnow.Classification.OptimizerOpens in a new tab class to include those entities that will have the largest impact on predictive accuracy.

  1. Create an Optimizer object and use its Builder property to specify the %iKnow.Classification.Builder object to associate with it. Optionally, set the ScoreMetric property to specify how you want to measure performance (the default is MacroFMeasure).

      SET tOpt = ##class(%iKnow.Classification.Optimizer).%New(domId,tBuilder)
      SET tOpt.ScoreMetric="MicroPrecision"
    
  2. Include a large number of candidate terms, either from an array (using LoadTermsArray()Opens in a new tab) or using an SQL query (using LoadTermsSQL()Opens in a new tab.

  3. Run the Optimize()Opens in a new tab method. This will automatically add terms and remove terms based on their ScoreMetric values. Optimize() performs the specified number of rounds of adding potentially high-value terms, calculating their impact, then removing low-value terms.

Generate the Text Classifier

Once you have identified categories and terms, you generate a Text Classifier class. This text classifier class contains code to identify the most appropriate category label based on the terms found in the source. You specify a class name for your Text Classifier:

   WRITE tBuilder.%CreateClassifierClass("User.MyClassifier")

The operation performed by this method depends on the ClassificationMethod you specified. For Naive Bayes, the Builder first creates a matrix containing the match score/count for each term in each source text for which we also know the actual category. This builds a model of how well the specified terms are predictive of the assigned category.

In the example used here, the categories were taken from the AircraftCategory metadata field values. Each term is correlated with each source to determine how predictive that term is in determining the category. For example, the appearance of the term “helipad” is strongly predictive of a source with AircraftCategory=helicopter. The term “engine” is indicative of several categories — airplane or helicopter, but not glider or balloon — and is thus weakly predictive of a single category. However, including a term of this type may be helpful for eliminating some categories. The term “passenger” is only weakly predictive of any category, and is therefore probably not a good term for your text classifier model. You can use %AddEntity() and %RemoveTerm() to fit your dictionary of terms based on their contribution to the determination of a category.

Testing a Text Classifier

Your text classifier model has been fitted to its training set of documents so that the set of terms in its term dictionary accurately determine the category. You now need to test the model on a separate set of documents to determine if it is accurate for documents other than those in the training set. For this, you use the test set of documents. Like the training set, these documents also have a defined category label.

You can use the %TestClassifier()Opens in a new tab method to return a single accuracy value. The accuracy is the number of right predictions made divided by the total records tested. The higher the accuracy against the test set documents, the better the model.

   WRITE tBuilder.%TestClassifier(tTestSet,,.accuracy),!
   WRITE "model accuracy: ",$FNUMBER(100*accuracy,"L",2)," percent"

It is likely that the predictive accuracy for all categories is not the same. You should therefore test the accuracy for individual categories.

The following example returns both the overall accuracy and the individual incorrect prediction results:

TestClassifier
  WRITE tBuilder.%TestClassifier(tTestSet,.testresult,.accuracy),!
  WRITE "model accuracy: ",$FNUMBER(accuracy*100,"L",2)," percent",!
  SET n=1
  SET wrongs=0
  WHILE $DATA(testresult(n)) {
    IF $LISTGET(testresult(n),2) '= $LISTGET(testresult(n),3) {
      SET wrongcnt=wrongcnt+1
      WRITE "WRONG: ",$LISTGET(testresult(n),1)
      WRITE " actual ",$LISTGET(testresult(n),2)
      WRITE " pred. ",$LISTGET(testresult(n),3),! }
    SET n=n+1 }
  WRITE wrongcnt," out of ",n-1,!

Predictive accuracy for a category is calculated based on four possible outcomes of matching a prediction to a known category:

  • True Positive (TP): predicted as Category X, actually in Category X.

  • False Positive (FP): predicted as Category X, actually in some other category.

  • False Negative (FN): predicted as some other category, actually in Category X.

  • True Negative (TN): predicted as some other category, actually in some other category.

These counts are used to generate the following ratios:

Precision is the ratio of correct results to the number of results returned for a particular category: TP / (TP+FP). For example, the term “helipad” would contribute to a high precision ratio for the category Helicopter; nearly all texts that mention “helipad” are in the category Helicopter.

Recall is the ratio of correct results to the number of results that should have been returned for a particular category: TP / (TP+FN). For example, the term “helipad” is not likely to improve the recall ratio for the category “Helicopter” because only a few of these texts mention “helipad”.

The F-measure (F1) of the model for Category X combines the Precision and Recall values and derives the harmonic mean value of the two. Note that an increase in Precision may cause a decrease in Recall, and vice versa. Which of the two you wish to maximize depends on your use case. For example, in a medical screening application you may wish to accept more False Positives to minimize the number of False Negatives.

Note:

If a category value was not found in the training set, the Text Classifier cannot predict that category for a text in the test set. In this case, both the True Positive (TP) and False Positive (FP) will be zero, and False Negative (FN) with be the full count of texts with that category specified.

Using Test Results

If there is a significant discrepancy between the accuracy of the training set and the accuracy of the test set, the terms dictionary has been “overfitted” to the training set. To correct this problem, go back to the Build process and revise the term dictionary. You can generalize the term dictionary by replacing an individual term with a term array:

  SET stat=tBuilder.%RemoveTerm("Bell helicopter")
  SET tData(1)="Bell helicopter",tData(2)="Bell 206 helicopter",tData(3)="Bell 206A helicopter",
      tData(4)="Bell 206A-1 helicopter",tData(5)="Bell 206L helicopter",tData(6)="Bell 206L LongRanger"
  SET stat=tBuilder.%AddEntity(.tData)

You can also generalize the term dictionary by changing an individual term to allow for partial matches ("partialCount"), rather than only an exact match.

Building a Text Classifier Using the UI

You can build a Text Classifier using the InterSystems IRIS Management Portal. From the Management Portal Analytics option, select the Text Analytics option, then select Text Categorization. This displays two options: Model builder and Model tester.

All NLP domains exist within a specific namespace. Therefore, you must specify which namespace you wish to use. A namespace must be enabled for analytics before it can be used. Selecting an enabled namespace displays the Text Analytics option.

Note:

You cannot use the %SYS namespace for NLP operations. The Management Portal Analytics option is non-functional (greyed out) while in the %SYS namespace. Therefore, you must specify which existing namespace you wish to use by clicking the name of the current namespace at the top of any Management Portal interface page.

In this interface, you can either open an existing Text Classifier or build a new one. To build a new Text Classifier, you must already have a defined domain containing data sources. The data sources must contain a category field.

Define a Data Set for the UI

To create a new Text Classifier, you must have created a domain and populated it with data sources that can be used as the training set and the test set. Commonly, data from these sources should specify the following:

  • One or more data fields containing the text to be analyzed by the Text Classifier.

  • A metadata field containing the categories used by the Text Classifier. The data sources must contain at least one source for every possible category value.

  • A metadata field containing values that can be used to divide the data sources into a training set and a test set. This field should have no connection to the actual contents of the data, and thus enable a random division of the sources. For example, the source Id number, or a date or time value. In many cases, you will need to specify less than (<) and greater than (>) operators to enable the division of sources into sets.

The following is an example of data source field definitions:

  SET myquery="SELECT TOP 200 E.ID,A.AircraftCategory,E.Type,E.NarrativeFull "_
     "FROM Aviation.Aircraft AS A,Aviation.Event AS E "_
     "WHERE A.Event=E.ID"
  SET idfld="ID"
  SET grpfld="Type"
  SET dataflds=$LB("NarrativeFull")
  SET metaflds=$LB("AircraftCategory","ID")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"AircraftCategory")
  DO ##class(%iKnow.Queries.MetadataAPI).AddField(domId,"ID",$LB("=","<=",">"))

Build a Text Classifier

To create a new text classifier, select the New button. This displays the Create a new classifier window. Create a new Class name for your Text Classifier. Select an NLP Domain from the drop-down list of existing domains. Select the Category field containing the category labels from the drop-down list of metadata fields defined for the domain. For Training Set, select a metadata field from the drop-down list, an operator from the drop-down list, and specify a value. For example: EventDate <= 2004. For the Test Set, select the same metadata field, and a complementary operator and value. For example: EventDate > 2004. Alternatively, you can specify the Training Set and Test Set using SQL with an SQL query.

For Populate terms select a method of deriving terms from the drop-down list, and specify the number of top terms to derive. For example Top n terms by NB differentiation for Naive Bayes (NB). Then click Create.

This displays a screen with three panels:

The right panel displays the Model properties. Normally, you would not change these values. Clicking on the Data source domain name allows you to change the Category field, Training Set, or Test Set specified in the previous step. If you click the Gears icon in the button bar, you'll display some additional advanced controls.

The central panel (Selected terms) shows a tree view of the terms that you have already selected as part of the model. The left panel (Add terms) allows you to add terms to the model.

Terms Selection

Here are the most common ways to add entities to a terms dictionary:

  • Entities tab: type a substring in the box provided. All of the entities that contain that substring will be listed with their frequency and spread counts.

  • Top tab: select a metric (BM25 or TFIDF) from the drop-down list. The top entities according to that metric will be listed with their calculated score.

You can use the check boxes to select individual entities as terms, or you can scroll to the bottom of the list and click select all for the current page of listed entities. You can go on to additional pages, if desired. Once you have selected entities, scroll back up to the top of the Add terms list and click the Add button. This adds the selected entities to the Selected terms list in the central panel. (You may get an informational message if you have selected duplicate terms.)

Optional Advanced Term Selection

The Append button allows you to create composite terms. In a composite term, multiple similar entities all resolve to the same single term. Defining composite terms helps to make the Text Classifier more robust and general, less “fitted” to specific entities in the training set. You can also create a composite term in the Selected terms list by dragging and dropping one term atop another.

By default, all multiword entities you select are added to the terms dictionary as an exact occurrence: all of the words in the entity must appear in the specified order in a text to qualify as a match. However, you can instead specify individual entities as partial match entities. When adding a multiword entity, click the Gears icon to display the Count drop-down box. Select partial match count. Defining common multiword entities as partial match terms helps to make the Text Classifier more robust and general, less “fitted” to specific entities in the training set.

You can use the CRCs tab or the Cooccurrences tab to add these entities to the terms dictionary. Type an entity in the box provided, then press Enter. The top CRCs and Cooccurrences appear in the Add terms list in descending order. (A cooccurence is the appearance of two Concept entities, in any order, in the same sentence.) Commonly, most CRCs and cooccurences are too specific to add to a term dictionary. However, you may wish to add the most common CRCs and/or cooccurrences to the term directory.

You can remove terms from the Selected terms list in the central panel; click on an individual term (or entity within a composite term), then click the Remove button.

Save

When you have finalized the Model properties and list of Selected terms, click the Save button (or the Save as button). This builds your Text Classifier.

Note:

Text Classifiers are saved as ObjectScript classes by InterSystems IRIS. You can delete, import, and export Text Classifiers from the Management Portal Classes page.

Optimize the Text Classifier

Once you have saved a Text Classifier, you can use the Optimizer to automatically optimize the term dictionary. The Optimizer takes a list of additional candidate terms, tests each term, and adds those terms with the highest impact on the accuracy of the Text Classifier.

Click the Optimize button. This displays the Optimize term selection popup window. Select a Relevance metric (BM25, TFIDF, or Dominance) from the drop-down list. Specify a number of candidate terms using that metric, and click Load. The right panel lists the Candidate terms to test. Click Next.

This displays the Settings panel with default values. You can accept these defaults and click Start. This runs the Optimizer, adding and removing terms. When the optimization process completes, you can close the Optimize term selection popup window. Note that the Selected terms list in the central panel has changed. Click the Save button to save these additions to your terms dictionary.

You can run the Optimizer several times with different settings. After each optimization you can test the Text Classifier, as described in the following section.

Test the Text Classifier against a Test Set of Data

Click the Test button. This displays the Model tester in a new browser tab. The Model tester allows you to test your Text Classifier against test data. After testing it, you can return to the Model builder browser tab and add or remove terms, either manually using Add terms , or by running (or re-running) the Optimizer.

The Model tester provides two ways of testing your Text Classifier:

  • Domain tab: the Domain, Category field, and Test filter fields should take their values from the Model builder. In the Model tester click the Run button. This displays overall test results, and Detail test results for each category.

  • SQL tab: you can specify an SQL query to supply the test data in the Data Source SQL section. Use the _Text and _Category column aliases to identify the source text and the metadata category columns, as shown in the following example:

    SELECT E.ID,A.AircraftCategory AS _Category,E.Type,E.NarrativeFull AS _Text 
    FROM Aviation.Aircraft AS A,Aviation.Event AS E WHERE A.Event=E.ID

    Then click Test. This compares the actual category value with the category determined by the Text Classifier. It displays the overall test results, and details for each category.

Test the Text Classifier on Uncategorized Data

You can use the Text Classifier on a text string to test how it derives a category. Select the Test button. This displays the Text input window. Specify a text string, then press Categorize!.

  • The Text tab displays the text highlighted with the terms from the terms dictionary.

  • The Categories tab displays score bars for each category, with green representing correct and brown representing incorrect.

  • The Trace info tab displays probability bars for each term found in the input text. By using the Weights for category drop-down list you can determine the probability for each term for all categories (the default), or for individual categories.

Using a Text Classifier

Once you have built an accurate text classifier, you will want to apply it to source texts that have not yet been assigned a category label. Using methods of %iKnow.Classification.ClassifierOpens in a new tab, your text classifier can be used to predict the category of any unit of text.

To create a Text Classifier, you must first instantiate the subclass of %iKnow.Classification.ClassifierOpens in a new tab that represents the Text Classifier you wish to run. Once created, a text classifier is completely portable; you can use this text classifier class independently of the domain that contains the training set and test set data.

  • Use %Categorize() for NLP source texts. If a source text to be categorized has already been indexed in an NLP domain, you can use %Categorize()Opens in a new tab to match against the categories. It returns a match score for each category, in descending order by score.

       SET tClassifier = ##class(User.MyClassifier).%New("iKnow","MyDomain")
       WRITE tClassifier.%Categorize(.categories,srcId)
       ZWRITE categories
  • Use %CategorizeText() for a text specified as a string. If a source text to be categorized is a string, you can use %CategorizeText()Opens in a new tab to match an input string against the categories. It returns a match score for each category, in descending order by score.

       SET tClassifier = ##class(User.MyClassifier).%New()
       WRITE tClassifier.%CategorizeText(.categories,inputstring)
       ZWRITE categories

ZWRITE categories returns match score data such as the following:

categories=4
categories(1)=$lb("AIRPLANE",.4314703807485703701)
categories(2)=$lb("HELICOPTER",.04128622469233822948)
categories(3)=$lb("GLIDER",.0228365968611826442)
categories(4)=$lb("GYROCRAFT",.005880588058805880587)

In SQL you can execute a Text Classifier against a text string using a method stored procedure, as follows:

SELECT User.MyClassifier_sys_CategorizeSQL('input string') AS CategoryLabel

This returns a category label.

The following Embedded SQL example uses the Sample.MyTC Text Classifier to determine a category label for the first 25 records in Aviation.Event:

  FOR i=1:1:25 {
     SET rec=i
     &sql(SELECT %ID,NarrativeFull INTO :id,:inputstring FROM Aviation.Event WHERE %ID=:rec)
         WRITE "Record ",id
     &sql(SELECT Sample.MyTC_sys_CategorizeSQL(:inputstring) INTO :CategoryLabel)
        WRITE " assigned category: ",CategoryLabel,!
  }

Once texts have been classified, you can use this classification to filter texts using the %iKnow.Filters.SimpleMetadataFilterOpens in a new tab class. See Filtering by User-defined Metadata for further details.

FeedbackOpens in a new tab