Skip to main content
Previous section   Next section

Providers

Important:

IntegratedML is only available in special preview versions of InterSystems IRIS®.

Providers are powerful machine learning frameworks that are accessible in a common interface in IntegratedML. To choose a provider for training, select an ML configuration which specifies the desired provider.

You can pass additional parameters specific to these providers with a USING clause. See Parameter Customization for further discussion.

AutoML

AutoML is an automated machine learning system developed by InterSystems, housed within InterSystems IRIS®. IntegratedML, AutoML trains models quickly to produce accurate results. Additionally, AutoML features basic natural language processing (NLP), allowing the provider to smartly incorporate feature columns with unstructured text into machine learning models.

%AutoML is the system-default ML configuration for IntegratedML, and points to AutoML as the provider.

Caution:

AutoML is currently not supported on Windows. %H2O is the system-default ML configuration for IntegratedML on Windows.

Feature Engineering

AutoML uses feature engineering to modify existing features, create new ones, and remove unnecessary ones. These steps improve training speed and performance, including:

  • Column type classification to correctly use features in models

  • Feature elimination to remove redundancy and improve accuracy

  • One-hot encoding of categorical features

  • Filling in missing or null values in incomplete datasets

  • Creating new columns pertaining to hours/days/months/years, wherever applicable, to generate insights in your data related to time.

Model Selection

If a regression model is determined to be appropriate, AutoML uses a singular process for developing a regression model.

For classification models, AutoML uses the following selection process to determine the most accurate model:

  1. If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.

  2. AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present, to use the proper scoring metric.

  3. Using Monte Carlo cross validation, AutoML selects the model with the best scoring metrics for training on the entire dataset.

See More

For more information about AutoML, see About AutoML.

H20

You can specify H2O as your provider by setting %H2O as your ML configuratio.

You can also create a new ML configuration where PROVIDER points to H2O.

Parameters

You can pass the parameters listed in the H2O documentation with a USING clause. Please consult this source for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

By default, max_models is set to 5.

Model Selection

Label columns that are classified as type string, integer, or binary result in a classification model. All other types result in a regression model. If you want an integer type column to be trained by H2O as a regression model, you need to add the key value pair: "model_type":"regression" to your USING clause.

Training Log Output

You can query the LOG column of the INFORMATION_SCHEMA.ML_TRAINING_RUNS view after training models using H2O.

See More

For more information about H2O, see their documentation.

DataRobot

Important:

You must have a business relationship with DataRobot to use their AutoML capabilities and view their documentation.

DataRobot clients can use IntegratedML to train models with data stored within InterSystems IRIS®.

You can specify DataRobot as your provider by selecting a DataRobot configuration as your default ML configuration:

SET ML CONFIGURATION datarobot_configuration

where datarobot_configuration is the name of an ML configuration where PROVIDER points to DataRobot.

Parameters

IntegratedML uses the DataRobot API to make an HTTP request to start modeling. Please consult their documentation for information regarding expected input and how these parameters are handled. Unknown parameters result in an error during training.

You can pass parameters with a USING clause.

By default, quickrun is set to true.

PMML

IntegratedML supports PMML as a PMML consumer, making it easy for you to import and execute your PMML models using SQL.

How PMML Models work in IntegratedML

As with any other provider, you use a CREATE MODEL statement to specify a model definition, including features and labels. This model definition must contain the same features and label that your PMML model contains.

The TRAIN MODEL statement operates differently. Instead of “training” data, the TRAIN MODEL statement imports your PMML model. No training is necessary because the PMML model exhibits the properties of a trained model, including information on features and labels. The model is identified by a USING clause.

Important:

The feature and label columns specified in your model definition must match the feature and label columns of the PMML model.

While you still require a FROM clause in either your CREATE MODEL or TRAIN MODEL statement, the data specified is not used whatsoever.

Using your “trained” PMML model to make predictions works the same as any other trained model in IntegratedML. You can use the PREDICT function with any data that contains feature columns matching your PMML definition.

How to import a PMML Model

Before you can use a PMML model, set %PMML as your ML configuration, or select a different ML configuration where PROVIDER points to PMML.

You can specify a PMML model with a USING clause. You can choose one of the following parameters:

By Class Name

You can use the "class_name" parameter to specify the class name of a PMML model. For example:

USING {"class_name" : "IntegratedML.pmml.PMMLModel"}
Copy code to clipboard
By Directory Path

You can use the "file_name" parameter to specify the directory path to a PMML model. For example:

USING {"file_name" : "C:\temp\mydir\pmml_model.xml"}
Copy code to clipboard

Examples

The following examples highlight the multiple methods of passing a USING clause to specify a PMML model.

Specifying a PMML Model in an ML Configuration

The following series of statements creates a PMML configuration which specifies a PMML model for house prices by file name, and then imports the model with a TRAIN MODEL statement.

CREATE ML CONFIGURATION pmml_configuration PROVIDER PMML USING {"file_name" : "C:\PMML\pmml_house_model.xml"}
SET ML CONFIGURATION pmml_configuration
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Copy code to clipboard
Specifying a PMML Model in the TRAIN MODEL Statement

The following series of statements uses the provided %PMML configuration, and then specifies a PMML model by class name in the TRAIN MODEL statement.

SET ML CONFIGURATION %PMML
CREATE MODEL HousePriceModel PREDICTING (Price) WITH (TotSqft numeric, num_beds integer, num_baths numeric)
TRAIN MODEL HousePriceModel FROM HouseData USING {"class_name" : "IntegratedML.pmml.PMMLHouseModel"}
SELECT * FROM NewHouseData WHERE PREDICT(HousePriceModel) > 500000
Copy code to clipboard

Additional Parameters

If your PMML file contains multiple models, IntegratedML uses the first model in the file by default. To point to a different model within the file, use the model_name parameter in your USING clause:

TRAIN MODEL my_pmml_model FROM data USING {"class_name" : my_pmml_file, "model_name" : "model_2_name"}
Copy code to clipboard