%iKnow.Classification.Builder
abstract class %iKnow.Classification.Builder extends %Library.RegisteredObject
This is the framework class for building Text Categorization models, generating valid
%iKnow.Classification.Classifier subclasses.
Here's an example
using the %iKnow.Classification.IKnowBuilder:
// first initialize training and test sets set tDomainId = $system.iKnow.GetDomainId("Standalone Aviation demo") set tTrainingSet = ##class(%iKnow.Filters.SimpleMetadataFilter).%New(tDomainId, "Year", "<", 2007) set tTestSet = ##class(%iKnow.Filters.GroupFilter).%New(tDomainId, "AND", 1) // NOT filter do tTestSet.AddSubFilter(tTrainingSet) // Initialize Builder instance with domain name and test set set tBuilder = ##class(%iKnow.Classification.IKnowBuilder).%New("Standalone Aviation demo", tTrainingSet) // Configure it to use a Naive Bayes classifier set tBuilder.ClassificationMethod = "naiveBayes" // Load category info from metadata field "AircraftCategory" write tBuilder.%LoadMetadataCategories("AircraftCategory") // manually add a few terms write tBuilder.%AddEntity("ultralight vehicle") set tData(1) = "helicopter", tData(2) = "helicopters" write tBuilder.%AddEntity(.tData) write tBuilder.%AddEntity("balloon",, "partialCount") write tBuilder.%AddCooccurrence($lb("landed", "helicopter pad")) // or add them in bulk by letting the Builder instance decide write tBuilder.%PopulateTerms(50) // after populating the term dictionary, let the Builder generate a classifier class write tBuilder.%CreateClassifierClass("User.MyClassifier")
Property Inventory
- ClassificationMethod
- Description
- DocumentVectorLocalWeights
- DocumentVectorNormalization
- MinimumSpread
- MinimumSpreadPercent
Method Inventory
- %AddCRC()
- %AddCategory()
- %AddCooccurrence()
- %AddEntity()
- %AddTermsFromSQL()
- %CreateClassifierClass()
- %DispatchGetProperty()
- %DispatchMethod()
- %DispatchSetProperty()
- %ExportDataTable()
- %GenerateClassifier()
- %GetCategoryInfo()
- %GetRecordCount()
- %GetTerms()
- %LoadFromDefinition()
- %PopulateTerms()
- %RemoveTerm()
- %RemoveTermAtIndex()
- %RemoveTermEntryAtIndex()
- %Reset()
- %TestClassifier()
Properties
- "naiveBayes" uses a probability-based approach based on the Naive Bayes theorem,
- "rules" runs through a set of straightforward decision rules based on boolean expressions, each contributing to a single category's score if they fire. The category with the highest score wins.
- "euclideanDistance" treats the per-category term weights as a vector in the same vector space as the document term vector and calculates the euclidean distance between these vectors and the query vector.
- "cosineSimilarity" also treats the per-category term weights as a vector in the same vector space as the document term vector and looks at the (cosine of) the angle between these vectors.
- "linearRegression" considers the per-category term weights to be coefficients in a linear regression formula for calculating a category score, with the highest value winning
- "pmml" delegates the mathematical work to a predictive model defined in PMML. See also %iKnow.Classification.Methods.pmml
Methods
Adds one or more CRCs as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this CRC will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.
Multiple CRC can be supplied either as a one-dimensional array of 3-element-%Lists
.Adds one or more Cooccurrences as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this cooccurrence's entities will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.
A single cooccurrence can be supplied as a one-dimensional array of strings or a %List. Multiple cooccurrences can be supplied either as a one-dimensional array of %Lists or as a two-dimensional array of strings
.Adds one or more entities as a single term to the Text Categorization model's term dictionary. The term is to be counted only if it appears in the negation context defined by pNegation. If pCount = "exactCount", only exact occurrences of this entity will be counted to calculate its base score to be fed into the categorization algorithm. If it is set to "partialCount", both exact and partial matches will be considered and if set to "partialScore", the score of all exact and partial matches will be summed as this term's base score.
Multiple entities can be supplied either as a one-dimensional array or as a %List
.Adds all terms selected by pSQL as pType, taking the string value from the column named "term" with negation context pNegationContext and count policy pCount. If there are columns named "type", "negation" or "count" selected by the query, any values in these columns will be used instead of the defaults supplied through the respective parameters.
When adding CRC or Cooccurrence terms, use colons to separate the composing entities.
Generates a classifier definition and saves it to a %iKnow.Classification.Classifier subclass named pClassName. This will overwrite any existing class with that name if pOverwrite is 1. See also %GenerateClassifier().
Generates a %iKnow.Classification.Definition.Classifier XML tree based on the current set of categories and terms, with the appropriate weights and parameters calculated by the builder implementation (see %OnGenerateClassifier()).
Use pIncludeBuilderInfo to include specifications of how this classifier was built so it can be "reloaded" from the classifier XML to retrain the model.
Note: this does not load any (custom) weight information from the definition.
Adds pCount terms of type pType to this classifier's set of terms, selecting those terms that have a high relevance for the categorization task based on metric pMetric and/or the specifics of this builder implementation.
If pPerCategory is 1, (pCount \ [number of categories]) terms are selected using the specified metric as calculated within each category. This often gives better results, but might not be supported for every metric or builder.
Builder implementations should ensure these terms meet the conditions set forward by MinimumSpread and MinimumSpreadPercent. MinimumSpreadPercent can be ignored if pPerCategory = 1
This method implements a populate method for pMetric = "NaiveBayes", selecting terms based on their highest average per-category probability. In this case, the value of pPerCategory is ignored (automatically treated as 1). Implementations for other metrics can be provided by subclasses.
Utility method to batch-test the classifier against a test set pTestSet.
Per-record results are returned through pResult:
pResult(n) = $lb([record ID], [actual category], [predicted category])
pAccuracy will contain the raw accuracy (# of records predicted correctly) of the current model. Use %iKnow.Classificaton.Utils for more advanced model testing.
If the current model's category options were added through %AddCategory() without an appropriate category specification, use pCategorySpec to refer to the actual category values to test against.
Inherited Members
Inherited Methods
- %AddToSaveSet()
- %ClassIsLatestVersion()
- %ClassName()
- %ConstructClone()
- %DispatchClassMethod()
- %DispatchGetModified()
- %DispatchSetModified()
- %DispatchSetMultidimProperty()
- %Extends()
- %GetParameter()
- %IsA()
- %IsModified()
- %New()
- %NormalizeObject()
- %ObjectModified()
- %OriginalNamespace()
- %PackageName()
- %RemoveFromSaveSet()
- %SerializeObject()
- %SetModified()
- %ValidateObject()