Skip to main content
Previous sectionNext section

IntegratedML Basics

Important:

IntegratedML is only available in special preview versions of InterSystems IRIS®.

IntegratedML is a feature within InterSystems IRIS® which allows you to use automated machine learning functions directly from SQL to create and use predictive models.

IntegratedML Workflow
images/giml_workflow.png
  1. To use IntegratedML, you begin by specifying a model definition, which contains metadata about the input fields (features), predicted field (label), and the data types of these fields. The data itself is not stored within the model definition; just the structure of the data.

    • Optional — You can select the ML configuration, which specifies a provider to perform training. You can customize this configuration before training, or use the system default configuration without any action needed.

  2. You train the model on data, using the provider specified by your ML configuration. The provider uses a structured process to compare the performance of different machine learning model types (linear regression, random forest, etc.) with the dataset and return the appropriate model.

    • Optional — After training, you can validate the model using test data to evaluate the predictive performance of the model.

  3. Your trained model can now be invoked by SQL functions to make predictions on data.

Definitions

See below for definitions of IntegratedML-specific terms:

Models

Models are the primary objects used in IntegratedML. There are two types of model entities:

  • Model Definitions — With IntegratedML, models are part of the database schema, like tables or indexes. The CREATE MODEL statement introduces a new model definition into the schema. This model definition specifies the features, labels, and data types, along with the ML configuration to be used for training.

  • Trained Models— The TRAIN MODEL command uses a model definition to train a model with a provider specified by your configuration. This trained model is used to make predictions on data.

Providers

Several organizations offer ML-as-a-Service, supplying the tools and computation power to develop machine learning models based on datasets supplied by customers. These automated solutions often come in standalone applications, with no framework that connects directly to your datasets. You are then burdened with exporting your data to other workflows, subject to conditions that vary based on the machine learning framework.

IntegratedML addresses these issues by bringing automated machine learning capabilities directly inside the InterSystems IRIS® data platform, facilitating the connection between your data in InterSystems IRIS and these automated workflows. Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. The following providers are available:

  • AutoML — a machine learning engine developed by InterSystems, housed in InterSystems IRIS

  • H2O — an open-source automated machine learning platform

  • DataRobot — an advanced enterprise automated machine learning platform

ML Configurations

An ML configuration is a collection of settings that IntegratedML uses to train a model. Primarily, a configuration specifies a machine learning provider that will perform training. Depending on the provider, the configuration may also specify requisite information for connection such as a URL and/or an API token. A default ML configuration is immediately active upon installation, requiring no adjustment in a simplest case. Optionally, you can create and select additional configurations to suit individual needs.

Creating Model Definitions

Before you can train a model, you must use the CREATE MODEL statement to specify a model definition. A model definition is a template that IntegratedML uses to train models; it contains metadata about the input fields (features), predicted field (label), and the data types of these fields. Only the structure of the data is stored within the model definition, not the data itself.

Syntax

The CREATE MODEL statement has the following syntax:

CREATE MODEL model-name PREDICTING (label-column) [ WITH feature-column-clause ] FROM model-source [ USING json-object-string ]
Copy code to clipboard

Or

CREATE MODEL model-name PREDICTING (label-column) WITH feature-column-clause [ FROM model-source ] [ USING json-object-string ]
Copy code to clipboard

Or

CREATE MODEL model-name PREDICTING (label-column) WITH feature-column-clause FROM model-source [ USING json-object-string ]
Copy code to clipboard
Examples

The following examples highlight use of different clauses for your CREATE MODEL statements:

Selecting Feature Columns with FROM

The following command creates a model definition HousePriceModel. The label column, or the column to be predicted, is Price. The columns of the HouseData table are implicitly sourced as the feature columns of the model definition by using a FROM clause:

CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData
Copy code to clipboard

Important:

If you do not use FROM in your CREATE MODEL statement, FROM is required in your TRAIN MODEL statement.

Selecting Feature Columns with WITH

The following command creates the same model definition as above, HousePriceModel, but uses a WITH clause to explicitly name the feature columns and their data types:

CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
Copy code to clipboard
Selecting Training Parameters with USING

The following command uses the optional USING clause to specify parameters for the provider to train with. See Parameter Customization for further discussion of the USING clause.

CREATE MODEL HousePriceModel PREDICTING (Price) FROM HouseData USING {"seed": 3}
Copy code to clipboard
See More

You can view model definitions in the INFORMATION_SCHEMA.ML_MODELS view.

See Model Maintenance for more operations you can perform with your model definitions.

For complete information about the CREATE MODEL statement, see the reference.

Preparing Data for your Model

Before creating a model definition, you should consider the following items to prepare your dataset:

  • Organize your data into a singular view or table.

  • Evaluate your features:

    • If you have a column that is missing/NULL values for several rows, you may want to remove the column as this could adversely affect your trained model. You can also consider using a CASE expression to replace NULLs in your columns however you like.

    • Text-heavy data makes model training much slower.

Training Models

After creating a model definition, you can use the TRAIN MODEL statement to train a predictive model. IntegratedML trains this model using the provider specified by your ML configuration. The provider uses a structured process to compare the performance of different machine learning model types (linear regression, random forest, etc.) with the data and return the appropriate model.

Syntax

The TRAIN MODEL statement has the following syntax:

TRAIN MODEL model-name [ AS preferred-model-name ] [ NOT DEFAULT ] [ FOR label-column ] [ WITH feature-column-clause ] [ FROM model-source ] [ USING json-object-string ]
Copy code to clipboard
Examples

The following examples highlight use of different clauses for your TRAIN MODEL statements:

Simplest Syntax

The following command trains a model with the HousePriceModel model definition:

TRAIN MODEL HousePriceModel
Copy code to clipboard

Important:

If you did not use FROM in your CREATE MODEL statement, FROM is required in your TRAIN MODEL statement.

Selecting Training Data with FROM

The following command trains a model with the HousePriceModel model definition and HouseData as training data:

TRAIN MODEL HousePriceModel FROM HouseData
Copy code to clipboard
Naming the Training Run with AS

The following command trains a model with the HousePriceModel model definition. This trained model is saved with the name HousePriceModelTrained

TRAIN MODEL HousePriceModel AS HousePriceModelTrained FROM HouseData
Copy code to clipboard
Matching Feature Columns with WITH

The following command trains a model with the HousePriceModel model definition, and uses the FOR and WITH clauses to explicitly match the label and feature columns, respectively, between the training set and the model definition:

TRAIN MODEL HousePriceModel FOR house_price WITH (TotSqft = house_area, num_beds = beds, num_baths = bathrooms) FROM OtherHouseData
Copy code to clipboard
Selecting Training Parameters with USING

The following command uses the optional USING clause to specify parameters for the provider to train with. See Parameter Customization for further discussion of the USING clause.

TRAIN MODEL HousePriceModel USING {"seed": 3}
Copy code to clipboard
See More

You can view trained models and the results of training runs in the INFORMATION_SCHEMA.ML_TRAINED_MODELS view and INFORMATION_SCHEMA.ML_TRAINING_RUNS view, respectively. Trained models are associated with the model definition from which they were trained.

See Model Maintenance for more operations you can perform with your trained models.

For complete information about the TRAIN MODEL statement, see the reference.

Parameter Customization

The USING clause allows you to specify values for parameters that affect how your provider trains models. Machine learning experts can use this feature to fine-tune training runs to their needs.

TRAIN MODEL my-model USING {"seed": 3}
Copy code to clipboard

You can use the USING clause to pass provider-specific parameters for training. This clause accepts a JSON string containing key-value pairs of parameters and parameter values.

You can pass a USING clause in your CREATE MODEL and TRAIN MODEL statements, as well as in your ML configurations. They resolve as follows:

  • Any parameters you specify with a USING clause in your TRAIN MODEL command override values for the same parameters you may have specified in your CREATE MODEL command or in your default ML configuration.

  • Any parameters you specify with a USING clause in your CREATE MODEL command are implicitly used for your TRAIN MODEL command, and override values for the same parameters you may have specified in your default ML configuration.

  • If you do not specify a USING in your CREATE MODEL or TRAIN MODEL commands, your model uses the USING clause specified by your default ML configuration.

All parameter names must be passed as strings, and the values must be passed in the type specific to the parameter. Lists should be input in the form of a string with commas as delimiters.

See Providers for information about the parameters available to each provider.

Validating Models

While training, the provider performs validation throughout the process of outputting a trained model. IntegratedML supplies the VALIDATE statement so that you can perform your own validation on a model. VALIDATE returns simple metrics for both regression and classification models based on the provided testing set.

Syntax

The VALIDATE MODEL statement has the following syntax:

VALIDATE MODEL trained-model-name [ AS validation-run-name ] [ USE preferred-trained-model-name ] [ WITH feature-column-clause ] FROM testing-data-set
Copy code to clipboard
Examples

The following examples highlight use of different clauses for your VALIDATE MODEL statements:

Simplest Syntax

The following command validates the trained HousePriceModel using HouseTesting as a testing data set:

VALIDATE MODEL HousePriceModel From HouseTesting
Copy code to clipboard
Naming the Validation Run with AS

The following command validates the trained HousePriceModel and saves the validation run as HousePriceValidation using HouseTesting as a testing data set:

VALIDATE MODEL HousePriceModel AS HousePriceValidation From HouseTesting
Copy code to clipboard
Matching Feature Columns with WITH

The following command validates the trained HousePriceModel and uses a WITH clause to explicitly match feature columns from the testing data set, HouseTesting :

VALIDATE MODEL HousePriceModel WITH (TotSqft = area, num_beds = beds, num_baths = baths) From HouseTesting
Copy code to clipboard
See More

You can see validation runs and their results in the INFORMATION_SCHEMA.ML_VALIDATION_RUNS view and INFORMATION_SCHEMA.ML_VALIDATION_METRICS view, respectively

For complete information about the VALIDATE MODEL statement and validation metrics, see the reference.

Making Predictions

Each trained model has a specialized function, PREDICT, that calls on the provider to predict the result for each row in the applicable row-set. Classification models additionally have the PROBABILITY function, that calls on the provider to return the probability that the specified value is the correct result for the model.

These are scalar functions. and can be used anywhere in a SQL query and in any combination with other fields and functions.

PREDICT

You can use the PREDICT function to return the estimated (for regression models) or most likely (for classification models) value for the label column, by applying the given model (and hence provider) to each row in the applicable row-set. Each row provides the input columns (feature columns), from which the model returns the output (label).

Syntax

The PREDICT function has the following syntax:

PREDICT(model-name [ USE trained-model-name ] [ WITH feature-column-clause ] )
Copy code to clipboard
Examples

The following statements use the specialized PREDICT function of the model HousePriceModel in various forms:

SELECT *, PREDICT(HousePriceModel) FROM NewHouseData
Copy code to clipboard
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Copy code to clipboard
See More

For complete information about the PREDICT function, see the reference.

PROBABILITY

For classification models, you can use the PROBABILITY function to return the probability that a specified value is true for the given input:

Syntax

The PROBABILITY function has the following syntax:

PROBABILITY(model-name [ USE trained-model-name ] FOR label-value [ WITH feature-column-clause ] )
Copy code to clipboard

The following statements use the specialized PROBABILITY function of the model EmailSpamModel in various forms:

SELECT *, PROBABILITY(Iris_Model FOR 'iris-setosa') FROM Iris_Flower_Set
Copy code to clipboard
SELECT * FROM Iris_Flower_Set WHERE PROBABILITY(Iris_Model FOR 'iris-setosa') < 0.3
Copy code to clipboard

The following statement uses the specialized PROBABILITY function of the model EmailFilter. Since this is a binary classification model, with boolean values of 0 or 1 as the sole output, it can use the implicit FOR value of 1 to omit the FOR clause:

SELECT * EmailData WHERE PROBABILITY(EmailFilter) > 0.7
Copy code to clipboard
See More

For complete information about the PROBABILITY function, see the reference.

Walkthrough

This walkthrough illustrates the simple and powerful syntax IntegratedML offers through application to a real world scenario. Using a small number of SQL queries, the user develops a validated predictive model using their data.

Administrators in a health system have grown concerned about the increasing readmission rate for patients. Clinicians could be more cautions across the board when evaluating patient systems, but there are no defined criteria that would inform them of what to look for. Before investing fully into a new analytical solution, they task their data analyst with quickly developing a model to find trends in the profiles of patients that are readmitted. With their data stored on the InterSystems IRIS® database platform, the analyst knows that using IntegratedML would be far faster than any other solution that requires manually formatting and moving their data outside the platform.

Preparing the Data

Before using IntegratedML, the analyst prepares the data to make sure it is clean and ready for training. Any data the analyst needs from multiple tables are put into a singular view, for ease of use. In this example, the view is named Hospital.PatientDataView.

Customizing the Configuration

The analyst chooses to go with the default configuration for using IntregratedML. While the analyst is aware of the different providers they could use to train the model, for speed and ease of use they have gone with the default configuration with no additional syntax required.

Creating the Model

Data in hand, organized into a singular view, the analyst creates the model definition to be trained by an automated machine learning function. This definition, named PatientReadmission, specifies IsReadmitted as the label column to be predicted:

CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
Copy code to clipboard
Training the Model

The analyst now trains the model:

TRAIN MODEL PatientReadmission
Copy code to clipboard

The analyst does not need to specify any customized parameters for training.

Validating the Model

The analyst validates the model using a testing dataset they prepared (Hospital.PatientDataViewTesting), to get metrics on performance, and views these metrics:

VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS 
Copy code to clipboard
Making Predictions with the Model

With the model trained and validated, the analyst can now apply the model to make predictions on different datasets with the same schema. The analyst applies the model to Hospital.NewPatientDataView, a dataset containing information for patients that have been admitted in the past week, to see if any are susceptible for readmission:

SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1
Copy code to clipboard
Summary

In total, the analyst entered the following SQL queries to go from raw data to an active predictive model:

CREATE MODEL PatientReadmission PREDICTING (IsReadmitted) FROM Hospital.PatientDataView
TRAIN MODEL PatientReadmission
VALIDATE MODEL PatientReadmission FROM Hospital.PatientDataViewTesting
SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS
SELECT ID FROM Hospital.NewPatientDataView WHERE PREDICT(PatientReadmission) = 1
Copy code to clipboard