IntegratedML Basics
IntegratedML is a feature within InterSystems IRIS® data platform which allows you to use automated machine learning functions directly from SQL to create and use predictive models.
-
To use IntegratedML, you begin by specifying a model definition, which contains metadata about the input fields (features), predicted field (label), and the data types of these fields. Only the structure of the data is stored within the model definition, not the data itself.
-
Optional — You can select the ML configuration, which specifies a provider to perform training. You can customize this configuration before training, or use the system-default configuration without any action needed.
-
-
You train the model on data, using the provider specified in the active ML configuration. The provider uses a structured process to compare the performance of different machine learning model types (linear regression, random forest, etc.) with the dataset and return the appropriate model. This process varies by the provider.
-
Optional — After training the model, you can validate the model using test data to evaluate the predictive performance of the model.
-
-
Your trained model can now be invoked by SQL functions to make predictions on data.
See below for definitions of IntegratedML-specific terms:
Models are the primary objects used in IntegratedML. There are two types of model entities:
-
Model Definitions — With IntegratedML, models are part of the database schema, like tables or indexes. The CREATE MODEL statement introduces a new model definition into the schema. This model definition specifies the features, labels, and data types, along with the ML configuration to be used for training.
-
Trained Models— The TRAIN MODEL command uses a model definition to train a model with a provider specified by your configuration. This trained model is used to make predictions on data.
Several organizations offer ML-as-a-Service, supplying the tools and computational power to develop machine learning models based on datasets supplied by customers. These automated solutions often come in standalone applications, with no framework that connects directly to your datasets. You are then burdened with exporting your data to other workflows, subject to conditions that vary based on the machine learning framework.
IntegratedML addresses these issues by bringing automated machine learning capabilities directly inside the InterSystems IRIS® data platform, facilitating the connection between your data in InterSystems IRIS and these automated workflows. Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. The following providers are available:
-
AutoML — a machine learning engine developed by InterSystems, housed in InterSystems IRIS
-
H2O — an open-source automated machine learning platform
-
DataRobot — an advanced enterprise automated machine learning platform
An ML configuration is a collection of settings that IntegratedML uses to train a model. Primarily, a configuration specifies a machine learning provider that will perform training. Depending on the provider, the configuration may also specify requisite information for connection such as a URL and/or an API token. A default ML configuration is immediately active upon installation, requiring no adjustment in a simplest case. Optionally, you can create and select additional configurations to suit individual needs.
Creating Model Definitions
Before you can train a model, you must use the CREATE MODEL statement to specify a model definition. A model definition is a template that IntegratedML uses to train models; it contains metadata about the input fields (features), predicted field (label), and the data types of these fields. Only the structure of the data is stored within the model definition, not the data itself.
IntegratedML can be used to create three kinds of models: regression, classification, and time series. The syntax for creating models for regression or classification tasks differs from the syntax for creating models for time series tasks. You can refer to these varying syntaxes on the CREATE MODEL reference page.
Examples — CREATE MODEL
The following examples highlight use of different clauses for your CREATE MODEL statements. The first three examples outline various options when creating a regression or classification model, while the last example shows how to create a time series model.
The following command creates a model definition HousePriceModel. The label column, or the column to be predicted, is Price. The columns of the HouseData table are implicitly sourced as the feature columns of the model definition by using a FROM clause:
CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData
If you do not use FROM in your CREATE MODEL statement for a classification or regression model, FROM is required in your TRAIN MODEL statement.
The following command creates the same model definition as above, HousePriceModel, but uses a WITH clause to explicitly name the feature columns and their data types:
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
The following command uses the optional USING clause to specify parameters for the provider to train with. See Adding Training Parameters (the USING clause) for further discussion of the USING clause.
CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData USING {"seed": 3}
As opposed to the other examples, which create models used for regression or classification tasks, this example creates a time series model that predicts subsequent rows at 3 steps into the future.
CREATE TIME SERIES MODEL ForecastModel PREDICTING (*) BY (date) FROM WeatherData USING {"FORWARD":3 }
Preparing Data for your Model
Before creating a model definition, you should consider the following items to prepare your dataset:
-
Organize your data into a singular view or table.
-
Evaluate your features:
-
If you have a column that is missing values for several rows, or contains NULL values for several rows, you may want to remove the column as this could adversely affect your trained model. You can also consider using a CASE expression to replace NULLs in your columns however you like.
-
Text-heavy data makes model training much slower.
-
See More
You can view model definitions in the INFORMATION_SCHEMA.ML_MODELS view.
See Model Maintenance for more operations you can perform with your model definitions.
For complete information about the CREATE MODEL statement, see the InterSystems SQL Reference.
Training Models
After creating a model definition, you can use the TRAIN MODEL statement to train a predictive model. IntegratedML trains this model using the provider specified by your ML configuration. The provider uses a structured process to compare the performance of different machine learning model types (linear regression, random forest, etc.) with the data and return the appropriate model.
Refer to the TRAIN MODEL reference page for a syntax overview and full description.
Examples — TRAIN MODEL
The following examples highlight use of different clauses for your TRAIN MODEL statements:
The following command trains a model with the HousePriceModel model definition:
TRAIN MODEL HousePriceModel
If you did not use FROM in your CREATE MODEL statement, FROM is required in your TRAIN MODEL statement.
The following command trains a model with the HousePriceModel model definition and HouseData as training data:
TRAIN MODEL HousePriceModel FROM HouseData
The following command trains a model with the HousePriceModel model definition. This trained model is saved with the name HousePriceModelTrained
TRAIN MODEL HousePriceModel AS HousePriceModelTrained FROM HouseData
The following command trains a model with the HousePriceModel model definition, and uses the FOR and WITH clauses to explicitly match the label and feature columns, respectively, between the training set and the model definition:
TRAIN MODEL HousePriceModel FOR house_price WITH (TotSqft = house_area, num_beds = beds, num_baths = bathrooms) FROM OtherHouseData
The following command uses the optional USING clause to specify parameters for the provider to train with. See Adding Training Parameters (the USING clause) for further discussion of the USING clause.
TRAIN MODEL HousePriceModel USING {"seed": 3}
Adding Training Parameters (the USING clause)
The USING clause allows you to specify values for parameters that affect how your provider trains models. Machine learning experts can use this feature to fine-tune training runs to their needs.
For example:
TRAIN MODEL my-model USING {"seed": 3}
You can use the USING clause to pass provider-specific training parameters. This clause accepts a JSON string containing key-value pairs of parameter names and parameter values. These value pairs are case insensitive.
You can pass a USING clause in your CREATE MODEL and TRAIN MODEL statements, as well as in your ML configurations. They resolve as follows:
-
Any parameters you specify with a USING clause in your TRAIN MODEL command override values for the same parameters you may have specified in your CREATE MODEL command or in your default ML configuration.
-
Any parameters you specify with a USING clause in your CREATE MODEL command are implicitly used for your TRAIN MODEL command, and override values for the same parameters you may have specified in your default ML configuration.
-
If you do not specify a USING in your CREATE MODEL or TRAIN MODEL commands, your model uses the USING clause specified by your default ML configuration.
All parameter names must be passed as strings, and the values must be passed in the type specific to the parameter. Lists should be input in the form of a string with commas as delimiters.
See below for information about the parameters available to each of the following providers:
See More
You can view trained models and the results of training runs in the INFORMATION_SCHEMA.ML_TRAINED_MODELS view and INFORMATION_SCHEMA.ML_TRAINING_RUNS view, respectively. Trained models are associated with the model definition from which they were trained.
See Model Maintenance for more operations you can perform with your trained models.
For complete information about the TRAIN MODEL statement, see the InterSystems SQL Reference.
Validating Models
While training, the provider performs validation throughout the process of outputting a trained model. IntegratedML supplies the VALIDATE MODEL statement so that you can perform your own validation on a model. VALIDATE MODEL returns simple metrics for regression, classification, and time series models based on the provided testing set.
Refer to the VALIDATE MODEL reference page for a syntax overview and full description.
Examples — VALIDATE MODEL
The following examples highlight use of different clauses for your VALIDATE MODEL statements:
The following command validates the trained HousePriceModel using HouseTesting as a testing data set:
VALIDATE MODEL HousePriceModel From HouseTesting
The following command validates the trained HousePriceModel and saves the validation run as HousePriceValidation using HouseTesting as a testing data set:
VALIDATE MODEL HousePriceModel AS HousePriceValidation From HouseTesting
The following command validates the trained HousePriceModel and uses a WITH clause to explicitly match feature columns from the testing data set, HouseTesting:
VALIDATE MODEL HousePriceModel WITH (TotSqft = area, num_beds = beds, num_baths = baths) From HouseTesting
See More
You can see validation runs and their results in the INFORMATION_SCHEMA.ML_VALIDATION_RUNS view and INFORMATION_SCHEMA.ML_VALIDATION_METRICS view, respectively
For complete information about the VALIDATE MODEL statement and validation metrics, see the InterSystems SQL Reference.
Making Predictions
Each trained model has a specialized function, PREDICT, that calls on the provider to predict the result for each row in the applicable row-set. Classification models additionally have the PROBABILITY function, that calls on the provider to return the probability that the specified value is the correct result for the model.
These are scalar functions. and can be used anywhere in a SQL query and in any combination with other fields and functions.
PREDICT
You can use the PREDICT function to return the estimated (for regression models) or most likely (for classification models) value for the label column, by applying the given model (and hence provider) to each row in the applicable row-set. Each row provides the input columns (feature columns), from which the model returns the output (label). A row-set can be any set of rows that includes the required feature columns and the label column.
The PREDICT function has the following syntax:
PREDICT(model-name [ USE trained-model-name ] [ WITH feature-column-clause ] )
The following statements use the specialized PREDICT function of the model HousePriceModel in various forms:
SELECT *, PREDICT(HousePriceModel) FROM NewHouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
You can use the WITH clause to make a prediction based on specified values corresponding to columns in the trained model, rather than a full dataset. Arguments must be ordered exactly as specified in your CREATE MODEL statement; missing arguments can be indicated by empty commas. For example, the following statement makes predictions based on two sets of column data:
SELECT PREDICT(HousePriceModel WITH ({4200, 5, 4},{3800, , 3)))
You can also use WITH to specify the mapping of columns between the model and a dataset different from the one it was created on. The following statement maps three feature columns from the model to columns in the alternate dataset:
SELECT PREDICT(HousePriceModel WITH (TotSqft = house_area, num_beds = beds, num_baths = bathrooms) FROM OtherHouseData
For complete information about the PREDICT function, see the InterSystems SQL Reference.
PROBABILITY
When using classification models, which predict discrete values such as true/false or country name (as opposed to regression models, which predict continuous numeric values such as cost or lab result), you can use the PROBABILITY function to return for each row the probability that the specified label value is the predicted label value. This allows you to evaluate the relative strength of predictions of that value.
The PROBABILITY function has the following syntax:
PROBABILITY(model-name [ USE trained-model-name ] FOR label-value [ WITH feature-column-clause ] )
The following statements use the specialized PROBABILITY function of the model Iris_model in various forms:
SELECT *, PROBABILITY(Iris_Model FOR 'iris-setosa') FROM Iris_Flower_Set
SELECT * FROM Iris_Flower_Set WHERE PROBABILITY(Iris_Model FOR 'iris-setosa') < 0.3
The following statement uses the specialized PROBABILITY function of the model EmailFilter. Since this is a binary classification model, with boolean values of 0 or 1 as the sole output, it can use the implicit FOR value of 1 to omit the FOR clause:
SELECT * EmailData WHERE PROBABILITY(EmailFilter) > 0.7
For complete information about the PROBABILITY function, see the InterSystems SQL Reference.
SELECT WITH PREDICTIONS
Since time series models predict rows of data, rather than specific elements, you must execute a SELECT statement to see a model’s predictions.
A SELECT WITH PREDICTIONS command has the following syntax:
SELECT WITH PREDICTIONS(model-name) columns FROM table-name
The following example selects all rows and columns from a table:
SELECT WITH PREDICTIONS(WeatherModel) * FROM Weather.Data
The following example selects only the rows after a certain date, allowing you to view a section of the table, such as the set of predicted rows:
SELECT WITH PREDICTIONS(WeatherModel) * FROM Weather.Data WHERE Date > 20220101
For complete information about selecting with predictions, see the InterSystems SQL Reference.
Walkthrough
This walkthrough illustrates the simple and powerful syntax IntegratedML offers through application to a real world scenario. Using a small number of SQL queries, the user develops a validated predictive model using their data.
Administrators in a health system have grown concerned about the increasing readmission rate for patients. Clinicians could be more cautions across the board when evaluating patient systems, but there are no defined criteria that would inform them of what to look for. Before investing fully into a new analytical solution, they task their data analyst with quickly developing a model to find trends in the profiles of patients that are readmitted. With their data stored on the InterSystems IRIS® data platform database platform, the analyst knows that using IntegratedML would be far faster than any other solution that requires manually formatting and moving their data outside the platform.
Before using IntegratedML, the analyst prepares the data to make sure it is clean and ready for training. Any data the analyst needs from multiple tables are put into a singular view, for ease of use. In this example, the view is named Hospital.PatientDataView.
The analyst chooses to go with the default configuration for using IntregratedML. While the analyst is aware of the different providers they could use to train the model, for speed and ease of use they have gone with the default configuration with no additional syntax required.
Data in hand, organized into a singular view, the analyst creates the model definition to be trained by an automated machine learning function. This definition, named PatientReadmission, specifies IsReadmitted as the label column to be predicted:
CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
The analyst now trains the model:
TRAIN MODEL PatientReadmission
The analyst does not need to specify any customized parameters for training.
The analyst validates the model using a testing dataset they prepared (Hospital.PatientDataViewTesting), to get metrics on performance, and views these metrics:
VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS
With the model trained and validated, the analyst can now apply the model to make predictions on different datasets with the same schema. The analyst applies the model to Hospital.NewPatientDataView, a dataset containing information for patients that have been admitted in the past week, to see if any are susceptible for readmission:
SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1
In total, the analyst entered the following SQL queries to go from raw data to an active predictive model:
CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
TRAIN MODEL PatientReadmission
VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS
SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1