CREATE MODEL (SQL)
Synopsis
CREATE MODEL [ IF NOT EXISTS ] model-name
PREDICTING ( label-column )
FROM model-source
[ USING json-object ]
CREATE MODEL [ IF NOT EXISTS ] model-name
PREDICTING ( label-column )
WITH feature-column-clause
[ USING json-object ]
CREATE MODEL [ IF NOT EXISTS ] model-name
PREDICTING ( label-column )
WITH feature-column-clause
FROM model-source
[ USING json-object ]
CREATE [ TIME ] SERIES MODEL [ IF NOT EXISTS ] model-name
PREDICTING ( label-column1, label-column2, ...)
BY ( timestep )
FROM model-source
[ USING json-object ]
Arguments
This synopsis shows the valid forms of CREATE MODEL. The CREATE MODEL command must have either a FROM or WITH clause (or both).
model-name | The name for the model definition being created. A valid identifier, subject to the same additional naming restrictions as a table name. A model name is unqualified (modelname). An unqualified model name takes the default schema name. |
PREDICTING ( label-column ) | The name of the column being predicted, aka, the label column. A standard identifier. See details below. |
WITH feature-column-clause | Inputs to the model, aka the feature columns, as either the name of a column and it’s datatype or as a comma-separated list of the names of columns and datatypes. Each column name is a standard identifier. |
FROM model-source | The table or view from which the model is being built. This can be a table, view, or results of a join. |
USING json-object-string | Optional — A JSON string specifying one or more key-value pairs. See more details below. |
BY ( timestep ) | The column containing the time-based data that a time series model will be built on. |
Description
The CREATE MODEL command creates a model definition of the structure specified. This includes, at a minimum:
-
The model name
-
The label column (or columns, for a time series model)
-
The feature column(s)
Regression and classification models are largely created in the same way and have the same considerations. However, time series models employ a slightly different syntax because they require different considerations. These differences between these types of models are enumerated in the applicable clauses below.
Predicting
You must specify the output column (or label column) that your model predicts, given the input columns (or feature columns). For example, if you are designing a SpamFilter model which identifies emails that are spam mail, you may have a label column named IsSpam, which is a boolean value designating whether a given email is spam or not. You can also specify the data type of this column; otherwise, IntegratedML infers the type:
CREATE MODEL SpamFilter PREDICTING (IsSpam) FROM EmailData
CREATE MODEL SpamFilter PREDICTING (IsSpam binary) FROM EmailData
When creating a time series model, you will often want to predict values for multiple columns. To do so, specify the names of the columns that you would like to predict in a comma-separated list. You may also specify the data type of this column; otherwise, IntegratedML infers the type. To specify that the model should predict values for every column in the table, use an asterisk (*).
CREATE TIME SERIES MODEL WeatherForecast PREDICTING (Temp, Precipitation, Humidity, UVIndex) BY (Date) FROM WeatherData
CREATE TIME SERIES MODEL WeatherForecast PREDICTING (*) BY (DATE) FROM WeatherData
WITH and FROM
A classification or regression model definition must contain a WITH or FROM or both to specify the schema characteristics of the model. A time series model must contain a FROM clause and cannot have a WITH.
Using WITH, you can specify which input columns (features) to include in your model definition. Note that you must specify the data type of each column, even when using a FROM clause in your statement:
CREATE MODEL SpamFilter PREDICTING (IsSpam) WITH (email_length int, subject_title varchar)
CREATE MODEL SpamFilter PREDICTING (IsSpam) WITH (email_length int, subject_title varchar) FROM EmailData
FROM allows you to use every single column from a specified table or view, without having to identify each column individually:
CREATE MODEL SpamFilter PREDICTING (IsSpam) FROM EmailData
This clause is fully general, and can specify any subquery expression. IntegratedML infers the data types of each column. By using FROM, you supply a default data set for future TRAIN MODEL statements using this model definition. You can use FROM along with WITH to both supply a default data set and to explicitly name feature columns.
Without a WITH clause, IntegratedML infers the data types of each column, and implicitly uses the result of the FROM clause as if it were the following query:
SELECT * FROM model-source
USING
You can specify a default USING clause for your model definition. This clause accepts a JSON string with one or more key-value pairs. When TRAIN MODEL is executed, by default the USING clause of the model definition is used. All parameters specified in the USING clause of your ML configuration overwrite those same parameters in the USING clause of your model definition.
You must make sure that the parameters you specify are recognized by the provider you select. Failing to do so may result in an error when training.
Time Series Parameters
Time series models also support three optional parameters in a USING clause. These parameters are case insensitive. They are:
-
forward specifies the number of timesteps in the future that you would like to predict as a positive integer. Predicted rows will appear after the latest time or date in the original dataset. You may specify both this and the backward setting at the same time.
-
backward specifies the number of timesteps in the past that you would like to predict as a positive integer. Predicted rows will appear before the earliest time or date in the original dataset. You may specify both this and the forward setting at the same time. The AutoML provider ignores this parameter.
-
frequency specifies both the size and unit of the predicted timesteps as a positive integer followed by a letter that denotes the unit of time. If this value is not specified, the most common timestep in the data is supplied. The DataRobot provider ignores this parameter.
The letter abbreviations for units of time are outlined in the following table:
Abbreviation Unit of Time y year m month w week d day h hour t minute s second
Required Security Privileges
Calling CREATE MODEL requires %MANAGE_MODEL privileges; otherwise, there is a SQLCODE –99 error (Privilege Violation). To assign %MANAGE_MODEL privileges, use the GRANT command.
Model Naming Conventions
Model names follow identifier conventions, subject to the restrictions below. By default, model names are simple identifiers. A model name should not exceed 256 characters. Model names are not case-sensitive.
InterSystems IRIS uses the model name to generate a corresponding class name. A class name contains only alphanumeric characters (letters and numbers) and must be unique within the first 96 characters. To generate this class name, InterSystems IRIS first strips punctuation characters from the model name, and then generates an identifier that is unique within the first 96 characters, substituting an integer (beginning with 0) for the final character when needed to create a unique class name. InterSystems IRIS generates a unique class name from a valid model name, but this name generation imposes the following restrictions on the naming of models:
-
A model name must include at least one letter. Either the first character of the view name or the first character after initial punctuation characters must be a letter
-
InterSystems IRIS supports 16-bit (wide) characters for model names. A character is a valid letter if it passes the $ZNAME test.
-
If the first character of the model name is a punctuation character, the second character cannot be a number. This results in an SQLCODE -400 error, with a %msg value of “ERROR #5053: Class name 'schema.name' is invalid” (without the punctuation character). For example, specifying the model name %7A generates the %msg “ERROR #5053: Class name 'User.7A' is invalid”.
-
Because generated class names do not include punctuation characters, it is not advisable (though possible) to create a model name that differs from an existing model name only in its punctuation characters. In this case, InterSystems IRIS substitutes an integer (beginning with 0) for the final character of the name to create a unique class name.
-
A model name may be much longer than 96 characters, but model names that differ in their first 96 alphanumeric characters are much easier to work with.
A model name can only be unqualified. An unqualified model name (viewname) takes the system-wide default schema name.
Examples
CREATE MODEL PatientReadmit PREDICTING (IsReadmitted) FROM patient_table USING {"seed": 3}
CREATE MODEL PatientReadmit PREDICTING (IsReadmitted) WITH (age, gender, encounter_type, admit_reason, starttime, endtime, prior_visits, diagnosis, comorbitities)
CREATE TIME SERIES MODEL BusinessGrowth PREDICTING (*) BY (date) FROM BusinessData USING {"Forward":5}