Skip to main content

Providers

Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. To choose a provider for training, select an ML configuration which specifies the desired provider.

You can pass additional parameters specific to these providers with a USING clause. See Adding Training Parameters (the USING clause) for further discussion.

AutoML

AutoML is an automated machine learning system developed by InterSystems, housed within InterSystems IRIS®. IntegratedML, AutoML trains models quickly to produce accurate results. Additionally, AutoML features basic natural language processing (NLP), allowing the provider to smartly incorporate feature columns with unstructured text into machine learning models.

%AutoML is the system-default ML configuration for IntegratedML, and points to AutoML as the provider.

Training Parameters — AutoML

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

With AutoML, you can pass the following parameters into your training queries:

Training Parameter Description
seed A seed to initialize the random number generator. You can manually set any integer as the seed for reproducibility between training runs. By default, seed is set to “None”.
verbosity Determines how verbose the output of each training run is. This output can be found in the ML_TRAINING_RUNS view. You can specify any of the following options for verbosity:
  • 0 — Minimal/no output.
  • 1 — Moderate output.
  • 2 — Full output. This is the default setting for verbosity.
TrainMode Determines the model selection metric for classification models. You can specify one of the following options for TrainMode:
  • TIME” — Model selection prioritizes faster training time.
  • BALANCE” — Model selection compares models by an equal proportion of each model’s respective score and training time.
  • SCORE” — Model selection does not factor training run time at all. This is the default setting for TrainMode.
See the AutoML Reference for more information about these different modes.
MaxTime The number of hours allotted for initiating training runs. This does not necessarily limit training time. For example, if the MaxTime is set to 50 hours and there are 2 minutes remaining after a model is trained, another model could still be trained. This also does not account for the time required to fit the selected model to your data. By default, MaxTime is set to 240 hours.
MinimumDesiredScore The minimum score to allow for classification model selection, irrespective of the training mode selected. You can set any value between 0 and 1. By default, MinimumDesiredScore is set to 0.
Note:
If the TIME training mode is selected, and if the trained logistic regression or random forest classifier model exceeds the MinimumDesiredScore, AutoML does not train the neural network model to save time. See the AutoML Reference for more information about the different models used for classification models.

Feature Engineering

AutoML uses feature engineering to modify existing features, create new ones, and remove unnecessary ones. These steps improve training speed and performance, including:

  • Column type classification to correctly use features in models

  • Feature elimination to remove redundancy and improve accuracy

  • One-hot encoding of categorical features

  • Filling in missing or null values in incomplete datasets

  • Creating new columns pertaining to hours/days/months/years, wherever applicable, to generate insights in your data related to time.

Model Selection

If a regression model is determined to be appropriate, AutoML uses a singular process for developing a regression model.

For classification models, AutoML uses the following selection process to determine the most accurate model:

  1. If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.

  2. AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present, to use the proper scoring metric.

  3. Using Monte Carlo cross validation, AutoML selects the model with the best scoring metrics for training on the entire dataset.

Note:

A more detailed description of this model selection process can be found in the AutoML Reference.

Known Issues

The AutoML provider may be inoperable on RHEL7 installations, due to missing libraries required by AutoML. Typical errors encountered during a TRAIN MODEL statement may include text such as xgboost.core.XGBoostError. Users may be able to remedy the issue by installing required packages mentioned by these error messages.

See More

For more information about how AutoML works, see the AutoML Reference.

H2O

You can specify H2O as your provider by setting %H2O as your ML configuration.

You can also create a new ML configuration where PROVIDER points to H2O.

Training Parameters — H2O

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

See the H2O documentationOpens in a new tab for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

When training a model using the H2O provider, the max_models parameter is set to 5 by default.

Model Selection

Label columns that are classified as type string, integer, or binary result in a classification model. All other types result in a regression model. If you want an integer type column to be trained by H2O as a regression model, you need to add the key value pair: "model_type":"regression" to your USING clause. For example:

TRAIN MODEL h2o-model USING {"model_type": "regression"}

Training Log Output

You can query the LOG column of the INFORMATION_SCHEMA.ML_TRAINING_RUNS view after training models using H2O.

Known Issues

  • When training with the H2O provider, you may see the following error message:

    LogMessage: %ML Provider '%ML.H2O.Provider' is not available on this instance
      > ERROR #5002: ObjectScript error: <READ>%GetResponse+4^%Net.Remote.Object.1

    If you do, you can address this issue by performing the following:

    1. Log into the Management Portal.

    2. Go to System Administration > Configuration > Connectivity > External Language Servers.

    3. Select the server named %IntegratedML Server.

    4. Add the following to the JVM arguments field:

      -Djava.net.preferIPv6Addresses=true -Djava.net.preferIPv4Addresses=false
  • Setting the seed parameter with a USING clause for the H2O provider does not guarantee reproducible training runs. This is because the default training settings for H2O include the parameter max_models being set to 5, which triggers an early stopping mode. Reproducibility for the Gradient Boosting Model algorithm in H2O is a complex topic, as documentedOpens in a new tab by H2O.

See More

For more information about H2O, see their documentationOpens in a new tab.

DataRobot

Important:

You must have a business relationship with DataRobot to use their AutoML capabilities.

DataRobot clients can use IntegratedML to train models with data stored within InterSystems IRIS®.

You can specify DataRobot as your provider by selecting a DataRobot configuration as your default ML configuration:

SET ML CONFIGURATION datarobot_configuration

where datarobot_configuration is the name of an ML configuration where PROVIDER points to DataRobot.

Training Parameters — DataRobot

You can pass training parameters with a USING clause. For example:

TRAIN MODEL my-model USING {"seed": 3}

IntegratedML uses the DataRobot API to make an HTTP request to start modeling. Please consult their documentationOpens in a new tab for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

When training a model using the DataRobot provider, the quickrun parameter is set to true by default.

PMML

IntegratedML supports PMML as a PMML consumer, making it easy for you to import and execute your PMML models using SQL.

How PMML Models work in IntegratedML

As with any other provider, you use a CREATE MODEL statement to specify a model definition, including features and labels. This model definition must contain the same features and label that your PMML model contains.

The TRAIN MODEL statement operates differently. Instead of “training” data, the TRAIN MODEL statement imports your PMML model. No training is necessary because the PMML model exhibits the properties of a trained model, including information on features and labels. The model is identified by a USING clause.

Important:

The feature and label columns specified in your model definition must match the feature and label columns of the PMML model.

While you still require a FROM clause in either your CREATE MODEL or TRAIN MODEL statement, the data specified is not used whatsoever.

Using your “trained” PMML model to make predictions works the same as any other trained model in IntegratedML. You can use the PREDICT function with any data that contains feature columns matching your PMML definition.

How to import a PMML Model

Before you can use a PMML model, set %PMML as your ML configuration, or select a different ML configuration where PROVIDER points to PMML.

You can specify a PMML model with a USING clause. You can choose one of the following parameters:

By Class Name

You can use the "class_name" parameter to specify the class name of a PMML model. For example:

USING {"class_name" : "IntegratedML.pmml.PMMLModel"}
By Directory Path

You can use the "file_name" parameter to specify the directory path to a PMML model. For example:

USING {"file_name" : "C:\temp\mydir\pmml_model.xml"}

Examples

The following examples highlight the multiple methods of passing a USING clause to specify a PMML model.

Specifying a PMML Model in an ML Configuration

The following series of statements creates a PMML configuration which specifies a PMML model for house prices by file name, and then imports the model with a TRAIN MODEL statement.

CREATE ML CONFIGURATION pmml_configuration PROVIDER PMML USING {"file_name" : "C:\PMML\pmml_house_model.xml"}
SET ML CONFIGURATION pmml_configuration
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Specifying a PMML Model in the TRAIN MODEL Statement

The following series of statements uses the provided %PMML configuration, and then specifies a PMML model by class name in the TRAIN MODEL statement.

SET ML CONFIGURATION %PMML
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData USING {"class_name" : "IntegratedML.pmml.PMMLHouseModel"}
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000

Additional Parameters

If your PMML file contains multiple models, IntegratedML uses the first model in the file by default. To point to a different model within the file, use the model_name parameter in your USING clause:

TRAIN MODEL my_pmml_model FROM data USING {"class_name" : my_pmml_file, "model_name" : "model_2_name"}
FeedbackOpens in a new tab