Providers
Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. To choose a provider for training, select an ML configuration which specifies the desired provider.
You can pass additional parameters specific to these providers with a USING clause. See Adding Training Parameters (the USING clause) for further discussion.
AutoML
AutoML is an automated machine learning system developed by InterSystems, housed within InterSystems IRIS® data platform. AutoML trains models quickly to produce accurate results. Additionally, AutoML features basic natural language processing (NLP), allowing the provider to smartly incorporate feature columns with unstructured text into machine learning models.
%AutoML is the system-default ML configuration for IntegratedML, and points to AutoML as the provider.
To use AutoML, you must configure your instance to use Python 3.11 or later. See Use the Flexible Python Runtime Feature for information on how to configure your instance to use a supported Python version.
Installing AutoML
AutoML is a Python package installed with pip. There are two versions of AutoML that you can install: intersystems-iris-automl and intersystems-iris-automl-tf. intersystems-iris-automl-tf uses TensorFlowOpens in a new tab to aid in the construction of machine learning models. The installed module is called iris_automl.
To install the intersystems-iris-automl package in your instance of InterSystems IRIS, execute the following command through the command line:
python -m pip install --index-url https://registry.intersystems.com/pypi/simple --no-cache-dir --target <your installation directory>\mgr\python intersystems-iris-automl
To install the intersystems-iris-automl-tf package in your instance of InterSystems IRIS, execute the following command through the command line:
python -m pip install --index-url https://registry.intersystems.com/pypi/simple --no-cache-dir --target <your installation directory>\mgr\python intersystems-iris-automl-tf
Training Parameters — AutoML
You can pass training parameters with a USING clause. For example:
TRAIN MODEL my-model USING {"seed": 3}
With AutoML, you can pass the following parameters into your training queries:
Training Parameter | Description |
---|---|
seed | A seed to initialize the random number generator. You can manually set any integer as the seed for reproducibility between training runs. By default, seed is set to “None”. |
verbosity | Determines how verbose the output of each training run is. This output can be found in the ML_TRAINING_RUNS view. You can specify any of the following options for verbosity:
|
TrainMode | Determines the model selection metric for classification models. You can specify one of the following options for TrainMode:
|
MaxTime | The number of minutes allotted for initiating training runs. This does not necessarily limit training time. For example, if the MaxTime is set to 3000 minutes and there are 2 minutes remaining after a model is trained, another model could still be trained. By default, MaxTime is set to 14400 minutes.
Note:
This parameter is only applicable if TrainMode is set to “TIME” . |
MinimumDesiredScore | The minimum score to allow for classification model selection, irrespective of the training mode selected. You can set any value between 0 and 1. By default, MinimumDesiredScore is set to 0.
Note:
This parameter is only applicable if TrainMode is set to “TIME” . If the trained logistic regression or random forest classifier model exceeds the MinimumDesiredScore, then AutoML does not train the neural network model. See the AutoML Reference for more information about the different models used for classification models. |
IsRegression | Specifies whether AutoML should perform a regression task or a classification task. There are two values: 1 (for a regression task) and 0 (for a classification task).
If IsRegression is omitted or if it is set to any value besides 0 or 1, AutoML determines which type of task to perform. |
Feature Engineering
AutoML uses feature engineering to modify existing features, create new ones, and remove unnecessary ones. These steps improve training speed and performance, including:
-
Column type classification to correctly use features in models
-
Feature elimination to remove redundancy and improve accuracy
-
One-hot encoding of categorical features
-
Filling in missing or null values in incomplete datasets
-
Creating new columns pertaining to hours/days/months/years, wherever applicable, to generate insights in your data related to time.
Model Selection
If a regression model is determined to be appropriate, AutoML uses a singular process for developing a regression model.
For classification models, AutoML uses the following selection process to determine the most accurate model:
-
If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.
-
AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present, to use the proper scoring metric.
-
Using Monte Carlo cross validation, AutoML selects the model with the best scoring metrics for training on the entire dataset.
A more detailed description of this model selection process can be found in the AutoML Reference.
Platform Support and Known Issues
The AutoML provider is not supported on the following platforms:
-
any IBM AIX® platform
-
Red Hat Enterprise Linux 8 for ARM
-
Ubuntu 20.04 or Ubuntu 24.04 for ARM
Installation of AutoML requires a pip version of 20.3 or later. By default, Python versions 3.8.x or 3.9.0–3.9.2 do not use this version of pip, and you must manually upgrade pip to at least 20.3 to correctly install AutoML.
To update your version of pip execute the following command:
pip install --upgrade pip
AutoML is implemented using Python, which may lead to improper isolation between AutoML Python packages and Embedded Python packages. As a result, AutoML may be unable to find packages it needs to work correctly. To avoid this issue, add <path to instance>/lib/automl to the Python sys.path within your instance of InterSystems IRIS. To do so, open a Python shell with %SYS.Python.Shell()Opens in a new tab and enter the following commands:
import sys
sys.path.append("<path to instance>\\lib\\automl")
Additionally, on Windows, AutoML requires using Python 3.11.
See More
For more information about how AutoML works, see the AutoML Reference.
PMML
IntegratedML supports PMML as a PMML consumer, making it easy for you to import and execute your PMML models using SQL.
How PMML Models work in IntegratedML
As with any other provider, you use a CREATE MODEL statement to specify a model definition, including features and labels. This model definition must contain the same features and label that your PMML model contains.
The TRAIN MODEL statement operates differently. Instead of “training” data, the TRAIN MODEL statement imports your PMML model. No training is necessary because the PMML model exhibits the properties of a trained model, including information on features and labels. The model is identified by a USING clause.
The feature and label columns specified in your model definition must match the feature and label columns of the PMML model.
While you still require a FROM clause in either your CREATE MODEL or TRAIN MODEL statement, the data specified is not used whatsoever.
Using your “trained” PMML model to make predictions works the same as any other trained model in IntegratedML. You can use the PREDICT function with any data that contains feature columns matching your PMML definition.
How to import a PMML Model
Before you can use a PMML model, set %PMML as your ML configuration, or select a different ML configuration where PROVIDER points to PMML.
You can specify a PMML model with a USING clause. You can choose one of the following parameters:
You can use the "class_name" parameter to specify the class name of a PMML model. For example:
USING {"class_name" : "IntegratedML.pmml.PMMLModel"}
You can use the "file_name" parameter to specify the directory path to a PMML model. For example:
USING {"file_name" : "C:\temp\mydir\pmml_model.xml"}
Examples
The following examples highlight the multiple methods of passing a USING clause to specify a PMML model.
The following series of statements creates a PMML configuration which specifies a PMML model for house prices by file name, and then imports the model with a TRAIN MODEL statement.
CREATE ML CONFIGURATION pmml_configuration PROVIDER PMML USING {"file_name" : "C:\PMML\pmml_house_model.xml"}
SET ML CONFIGURATION pmml_configuration
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
The following series of statements uses the provided %PMML configuration, and then specifies a PMML model by class name in the TRAIN MODEL statement.
SET ML CONFIGURATION %PMML
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData USING {"class_name" : "IntegratedML.pmml.PMMLHouseModel"}
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Additional Parameters
If your PMML file contains multiple models, IntegratedML uses the first model in the file by default. To point to a different model within the file, use the model_name parameter in your USING clause:
TRAIN MODEL my_pmml_model FROM data USING {"class_name" : my_pmml_file, "model_name" : "model_2_name"}