IntegratedML is a feature within InterSystems IRIS® data platform which allows you to use automated machine learning functions directly from SQL to create and use predictive models.
Creating Model Definitions
Before you can train a model, you must use the CREATE MODEL statement to specify a model definition. A model definition is a template that IntegratedML uses to train models; it contains metadata about the input fields (features), predicted field (label), and the data types of these fields. Only the structure of the data is stored within the model definition, not the data itself.
IntegratedML can be used to create three kinds of models: regression, classification, and time series. The syntax for creating models for regression or classification tasks differs from the syntax for creating models for time series tasks. You can refer to these varying syntaxes on the CREATE MODEL reference page.
Examples — CREATE MODEL
The following examples highlight use of different clauses for your CREATE MODEL statements. The first three examples outline various options when creating a regression or classification model, while the last example shows how to create a time series model.
Selecting Feature Columns with FROM
The following command creates a model definition HousePriceModel. The label column, or the column to be predicted, is Price. The columns of the HouseData table are implicitly sourced as the feature columns of the model definition by using a FROM clause:
CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData
Important:
If you do not use FROM in your CREATE MODEL statement for a classification or regression model, FROM is required in your TRAIN MODEL statement.
Selecting Feature Columns with WITH
The following command creates the same model definition as above, HousePriceModel, but uses a WITH clause to explicitly name the feature columns and their data types:
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
Selecting Training Parameters with USING
The following command uses the optional USING clause to specify parameters for the provider to train with. See Adding Training Parameters (the USING clause) for further discussion of the USING clause.
CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData USING {"seed": 3}
Creating a Time Series Model
As opposed to the other examples, which create models used for regression or classification tasks, this example creates a time series model that predicts subsequent rows at 3 steps into the future.
CREATE TIME SERIES MODEL ForecastModel PREDICTING (*) BY (date) FROM WeatherData USING {"FORWARD":3 }
Preparing Data for your Model
Before creating a model definition, you should consider the following items to prepare your dataset:
Training Models
After creating a model definition, you can use the TRAIN MODEL statement to train a predictive model. IntegratedML trains this model using the provider specified by your ML configuration. The provider uses a structured process to compare the performance of different machine learning model types (linear regression, random forest, etc.) with the data and return the appropriate model.
Refer to the TRAIN MODEL reference page for a syntax overview and full description.
Examples — TRAIN MODEL
The following examples highlight use of different clauses for your TRAIN MODEL statements:
Simplest Syntax
The following command trains a model with the HousePriceModel model definition:
TRAIN MODEL HousePriceModel
Important:
If you did not use FROM in your CREATE MODEL statement, FROM is required in your TRAIN MODEL statement.
Selecting Training Data with FROM
The following command trains a model with the HousePriceModel model definition and HouseData as training data:
TRAIN MODEL HousePriceModel FROM HouseData
Naming the Training Run with AS
The following command trains a model with the HousePriceModel model definition. This trained model is saved with the name HousePriceModelTrained
TRAIN MODEL HousePriceModel AS HousePriceModelTrained FROM HouseData
Matching Feature Columns with WITH
The following command trains a model with the HousePriceModel model definition, and uses the FOR and WITH clauses to explicitly match the label and feature columns, respectively, between the training set and the model definition:
TRAIN MODEL HousePriceModel FOR house_price WITH (TotSqft = house_area, num_beds = beds, num_baths = bathrooms) FROM OtherHouseData
Selecting Training Parameters with USING
The following command uses the optional USING clause to specify parameters for the provider to train with. See Adding Training Parameters (the USING clause) for further discussion of the USING clause.
TRAIN MODEL HousePriceModel USING {"seed": 3}
Adding Training Parameters (the USING clause)
The USING clause allows you to specify values for parameters that affect how your provider trains models. Machine learning experts can use this feature to fine-tune training runs to their needs.
For example:
TRAIN MODEL my-model USING {"seed": 3}
You can use the USING clause to pass provider-specific training parameters. This clause accepts a JSON string containing key-value pairs of parameter names and parameter values. These value pairs are case-sensitive.
You can pass a USING clause in your CREATE MODEL and TRAIN MODEL statements, as well as in your ML configurations. They resolve as follows:
-
Any parameters you specify with a USING clause in your TRAIN MODEL command override values for the same parameters you may have specified in your CREATE MODEL command or in your default ML configuration.
-
Any parameters you specify with a USING clause in your CREATE MODEL command are implicitly used for your TRAIN MODEL command, and override values for the same parameters you may have specified in your default ML configuration.
-
If you do not specify a USING in your CREATE MODEL or TRAIN MODEL commands, your model uses the USING clause specified by your default ML configuration.
All parameter names must be passed as strings, and the values must be passed in the type specific to the parameter. Lists should be input in the form of a string with commas as delimiters.
See below for information about the parameters available to each of the following providers:
Validating Models
While training, the provider performs validation throughout the process of outputting a trained model. IntegratedML supplies the VALIDATE MODEL statement so that you can perform your own validation on a model. VALIDATE MODEL returns simple metrics for regression, classification, and time series models based on the provided testing set.
Refer to the VALIDATE MODEL reference page for a syntax overview and full description.
Examples — VALIDATE MODEL
The following examples highlight use of different clauses for your VALIDATE MODEL statements:
Simplest Syntax
The following command validates the trained HousePriceModel using HouseTesting as a testing data set:
VALIDATE MODEL HousePriceModel From HouseTesting
Naming the Validation Run with AS
The following command validates the trained HousePriceModel and saves the validation run as HousePriceValidation using HouseTesting as a testing data set:
VALIDATE MODEL HousePriceModel AS HousePriceValidation From HouseTesting
Matching Feature Columns with WITH
The following command validates the trained HousePriceModel and uses a WITH clause to explicitly match feature columns from the testing data set, HouseTesting:
VALIDATE MODEL HousePriceModel WITH (TotSqft = area, num_beds = beds, num_baths = baths) From HouseTesting
Making Predictions
Each trained model has a specialized function, PREDICT, that calls on the provider to predict the result for each row in the applicable row-set. Classification models additionally have the PROBABILITY function, that calls on the provider to return the probability that the specified value is the correct result for the model.
These are scalar functions. and can be used anywhere in a SQL query and in any combination with other fields and functions.
PREDICT
You can use the PREDICT function to return the estimated (for regression models) or most likely (for classification models) value for the label column, by applying the given model (and hence provider) to each row in the applicable row-set. Each row provides the input columns (feature columns), from which the model returns the output (label). A row-set can be any set of rows that includes the required feature columns and the label column.
Syntax
The PREDICT function has the following syntax:
PREDICT(model-name [ USE trained-model-name ] [ WITH feature-column-clause ] )
Examples
The following statements use the specialized PREDICT function of the model HousePriceModel in various forms:
SELECT *, PREDICT(HousePriceModel) FROM NewHouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
You can use the WITH clause to make a prediction based on specified values corresponding to columns in the trained model, rather than a full dataset. Arguments must be ordered exactly as specified in your CREATE MODEL statement; missing arguments can be indicated by empty commas. For example, the following statement makes predictions based on two sets of column data:
SELECT PREDICT(HousePriceModel WITH ({4200, 5, 4},{3800, , 3)))
You can also use WITH to specify the mapping of columns between the model and a dataset different from the one it was created on. The following statement maps three feature columns from the model to columns in the alternate dataset:
SELECT PREDICT(HousePriceModel WITH (TotSqft = house_area, num_beds = beds, num_baths = bathrooms) FROM OtherHouseData
See More
For complete information about the PREDICT function, see the InterSystems SQL Reference.
PROBABILITY
When using classification models, which predict discrete values such as true/false or country name (as opposed to regression models, which predict continuous numeric values such as cost or lab result), you can use the PROBABILITY function to return for each row the probability that the specified label value is the predicted label value. This allows you to evaluate the relative strength of predictions of that value.
Syntax
The PROBABILITY function has the following syntax:
PROBABILITY(model-name [ USE trained-model-name ] FOR label-value [ WITH feature-column-clause ] )
Examples
The following statements use the specialized PROBABILITY function of the model Iris_mdoel in various forms:
SELECT *, PROBABILITY(Iris_Model FOR 'iris-setosa') FROM Iris_Flower_Set
SELECT * FROM Iris_Flower_Set WHERE PROBABILITY(Iris_Model FOR 'iris-setosa') < 0.3
The following statement uses the specialized PROBABILITY function of the model EmailFilter. Since this is a binary classification model, with boolean values of 0 or 1 as the sole output, it can use the implicit FOR value of 1 to omit the FOR clause:
SELECT * EmailData WHERE PROBABILITY(EmailFilter) > 0.7
See More
For complete information about the PROBABILITY function, see the InterSystems SQL Reference.
SELECT WITH PREDICTIONS
Since time series models predict rows of data, rather than specific elements, you must execute a SELECT statement to see a model’s predictions.
Syntax
A SELECT WITH PREDICTIONS command has the following syntax:
SELECT WITH PREDICTIONS(model-name) columns FROM table-name
Examples
The following example selects all rows and columns from a table:
SELECT WITH PREDICTIONS(WeatherModel) * FROM Weather.Data
The following example selects only the rows after a certain date, allowing you to view a section of the table, such as the set of predicted rows:
SELECT WITH PREDICTIONS(WeatherModel) * FROM Weather.Data WHERE Date > 20220101
See More
For complete information about selecting with predictions, see the InterSystems SQL Reference.
Walkthrough
This walkthrough illustrates the simple and powerful syntax IntegratedML offers through application to a real world scenario. Using a small number of SQL queries, the user develops a validated predictive model using their data.
Administrators in a health system have grown concerned about the increasing readmission rate for patients. Clinicians could be more cautions across the board when evaluating patient systems, but there are no defined criteria that would inform them of what to look for. Before investing fully into a new analytical solution, they task their data analyst with quickly developing a model to find trends in the profiles of patients that are readmitted. With their data stored on the InterSystems IRIS® data platform database platform, the analyst knows that using IntegratedML would be far faster than any other solution that requires manually formatting and moving their data outside the platform.
Preparing the Data
Before using IntegratedML, the analyst prepares the data to make sure it is clean and ready for training. Any data the analyst needs from multiple tables are put into a singular view, for ease of use. In this example, the view is named Hospital.PatientDataView.
Customizing the Configuration
The analyst chooses to go with the default configuration for using IntregratedML. While the analyst is aware of the different providers they could use to train the model, for speed and ease of use they have gone with the default configuration with no additional syntax required.
Creating the Model
Data in hand, organized into a singular view, the analyst creates the model definition to be trained by an automated machine learning function. This definition, named PatientReadmission, specifies IsReadmitted as the label column to be predicted:
CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
Training the Model
The analyst now trains the model:
TRAIN MODEL PatientReadmission
The analyst does not need to specify any customized parameters for training.
Validating the Model
The analyst validates the model using a testing dataset they prepared (Hospital.PatientDataViewTesting), to get metrics on performance, and views these metrics:
VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS
Making Predictions with the Model
With the model trained and validated, the analyst can now apply the model to make predictions on different datasets with the same schema. The analyst applies the model to Hospital.NewPatientDataView, a dataset containing information for patients that have been admitted in the past week, to see if any are susceptible for readmission:
SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1
Summary
In total, the analyst entered the following SQL queries to go from raw data to an active predictive model:
CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
TRAIN MODEL PatientReadmission
VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS
SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1