Providers

Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. To choose a provider for training, select an ML configuration which specifies the desired provider.

You can pass additional parameters specific to these providers with a USING clause. See Adding Training Parameters (the USING clause) for further discussion.

AutoML

AutoML is an automated machine learning system developed by InterSystems, housed within InterSystems IRIS® data platform. AutoML trains models quickly to produce accurate results. Additionally, AutoML features basic natural language processing (NLP), allowing the provider to smartly incorporate feature columns with unstructured text into machine learning models.

%AutoML is the system-default ML configuration for IntegratedML, and points to AutoML as the provider.

To use AutoML, you must configure your instance to use Python 3.11 or later. See Use the Flexible Python Runtime Feature for information on how to configure your instance to use a supported Python version.

Installing AutoML

AutoML is a Python package installed with pip. There are two versions of AutoML that you can install: intersystems-iris-automl and intersystems-iris-automl-tf. intersystems-iris-automl-tf uses TensorFlowOpens in a new tab to aid in the construction of machine learning models. The installed module is called iris_automl.

To install the intersystems-iris-automl package in your instance of InterSystems IRIS, execute the following command through the command line:

python -m pip install --index-url https://registry.intersystems.com/pypi/simple --no-cache-dir --target <your installation directory>\mgr\python intersystems-iris-automl

To install the intersystems-iris-automl-tf package in your instance of InterSystems IRIS, execute the following command through the command line:

python -m pip install --index-url https://registry.intersystems.com/pypi/simple --no-cache-dir --target <your installation directory>\mgr\python intersystems-iris-automl-tf

Training Parameters — AutoML

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

With AutoML, you can pass the following parameters into your training queries:

Training Parameter	Description
seed	A seed to initialize the random number generator. You can manually set any integer as the seed for reproducibility between training runs. By default, seed is set to “None”.
verbosity	Determines how verbose the output of each training run is. This output can be found in the ML_TRAINING_RUNS view. You can specify any of the following options for verbosity: 0 — Minimal/no output. 1 — Moderate output. 2 — Full output. This is the default setting for verbosity.
TrainMode	Determines the model selection metric for classification models. You can specify one of the following options for TrainMode: “TIME” — Model selection prioritizes faster training time. “BALANCE” — Model selection compares models by an equal proportion of each model’s respective score and training time. “SCORE” — Model selection does not factor training run time at all. This is the default setting for TrainMode. See the AutoML Reference for more information about these different modes.
MaxTime	The number of minutes allotted for initiating training runs. This does not necessarily limit training time. For example, if the MaxTime is set to 3000 minutes and there are 2 minutes remaining after a model is trained, another model could still be trained. By default, MaxTime is set to 14400 minutes. Note: This parameter is only applicable if TrainMode is set to “TIME” .
MinimumDesiredScore	The minimum score to allow for classification model selection, irrespective of the training mode selected. You can set any value between 0 and 1. By default, MinimumDesiredScore is set to 0. Note: This parameter is only applicable if TrainMode is set to “TIME” . If the trained logistic regression or random forest classifier model exceeds the MinimumDesiredScore, then AutoML does not train the neural network model. See the AutoML Reference for more information about the different models used for classification models.
IsRegression	Specifies whether AutoML should perform a regression task or a classification task. There are two values: 1 (for a regression task) and 0 (for a classification task). If IsRegression is omitted or if it is set to any value besides 0 or 1, AutoML determines which type of task to perform.

Feature Engineering

AutoML uses feature engineering to modify existing features, create new ones, and remove unnecessary ones. These steps improve training speed and performance, including:

Column type classification to correctly use features in models
Feature elimination to remove redundancy and improve accuracy
One-hot encoding of categorical features
Filling in missing or null values in incomplete datasets
Creating new columns pertaining to hours/days/months/years, wherever applicable, to generate insights in your data related to time.

Model Selection

If a regression model is determined to be appropriate, AutoML uses a singular process for developing a regression model.

For classification models, AutoML uses the following selection process to determine the most accurate model:

If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.
AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present, to use the proper scoring metric.
Using Monte Carlo cross validation, AutoML selects the model with the best scoring metrics for training on the entire dataset.

Note:

A more detailed description of this model selection process can be found in the AutoML Reference.

Platform Support and Known Issues

Supported Platforms

The AutoML provider is not supported on the following platforms:

any IBM AIX® platform
Red Hat Enterprise Linux 8 for ARM
Ubuntu 20.04 or Ubuntu 24.04 for ARM

pip Version Requirement

Installation of AutoML requires a pip version of 20.3 or later. By default, Python versions 3.8.x or 3.9.0–3.9.2 do not use this version of pip, and you must manually upgrade pip to at least 20.3 to correctly install AutoML.

To update your version of pip execute the following command:

pip install --upgrade pip

AutoML and Embedded Python Package Isolation

AutoML is implemented using Python, which may lead to improper isolation between AutoML Python packages and Embedded Python packages. As a result, AutoML may be unable to find packages it needs to work correctly. To avoid this issue, add <path to instance>/lib/automl to the Python sys.path within your instance of InterSystems IRIS. To do so, open a Python shell with %SYS.Python.Shell()Opens in a new tab and enter the following commands:

import sys
sys.path.append("<path to instance>\\lib\\automl")

Additionally, on Windows, AutoML requires using Python 3.11.

See More

For more information about how AutoML works, see the AutoML Reference.

PMML

IntegratedML supports PMML as a PMML consumer, making it easy for you to import and execute your PMML models using SQL.

How PMML Models work in IntegratedML

As with any other provider, you use a CREATE MODEL statement to specify a model definition, including features and labels. This model definition must contain the same features and label that your PMML model contains.

The TRAIN MODEL statement operates differently. Instead of “training” data, the TRAIN MODEL statement imports your PMML model. No training is necessary because the PMML model exhibits the properties of a trained model, including information on features and labels. The model is identified by a USING clause.

Important:

The feature and label columns specified in your model definition must match the feature and label columns of the PMML model.

While you still require a FROM clause in either your CREATE MODEL or TRAIN MODEL statement, the data specified is not used whatsoever.

Using your “trained” PMML model to make predictions works the same as any other trained model in IntegratedML. You can use the PREDICT function with any data that contains feature columns matching your PMML definition.

How to import a PMML Model

Before you can use a PMML model, set %PMML as your ML configuration, or select a different ML configuration where PROVIDER points to PMML.

You can specify a PMML model with a USING clause. You can choose one of the following parameters:

By Class Name

You can use the "class_name" parameter to specify the class name of a PMML model. For example:

USING {"class_name" : "IntegratedML.pmml.PMMLModel"}

By Directory Path

You can use the "file_name" parameter to specify the directory path to a PMML model. For example:

USING {"file_name" : "C:\temp\mydir\pmml_model.xml"}

Examples

The following examples highlight the multiple methods of passing a USING clause to specify a PMML model.

Specifying a PMML Model in an ML Configuration

The following series of statements creates a PMML configuration which specifies a PMML model for house prices by file name, and then imports the model with a TRAIN MODEL statement.

CREATE ML CONFIGURATION pmml_configuration PROVIDER PMML USING {"file_name" : "C:\PMML\pmml_house_model.xml"}
SET ML CONFIGURATION pmml_configuration
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000

Specifying a PMML Model in the TRAIN MODEL Statement

The following series of statements uses the provided %PMML configuration, and then specifies a PMML model by class name in the TRAIN MODEL statement.

SET ML CONFIGURATION %PMML
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData USING {"class_name" : "IntegratedML.pmml.PMMLHouseModel"}
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000

Additional Parameters

If your PMML file contains multiple models, IntegratedML uses the first model in the file by default. To point to a different model within the file, use the model_name parameter in your USING clause:

TRAIN MODEL my_pmml_model FROM data USING {"class_name" : my_pmml_file, "model_name" : "model_2_name"}

ML Configurations

IntegratedML Basics