Model Selection Process
If the label column is of type float or complex, AutoML trains a regression model using XGBRegressor.
For classification models, AutoML uses the following selection process to determine the most accurate model:
If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.
The size of the dataset is calculated by multiplying the number of columns by the number of rows. If this calculated size is larger than the target size, sampling is needed. The number of rows that can be utilized is calculated by dividing the target size by the number of columns. This number of rows is randomly selected from the entire dataset to be used only for the purposes of model selection.
AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present.
If it is a binary classification problem, the ROC AUC scoring metric is used.
Otherwise, the F1 scoring metric is used.
These scoring metrics are then computed for each model using Monte Carlo cross validation, with three training/testing splits of 70%/30%. Depending on the training mode, the best model is determined as follows:Note:
For the mathematical expressions listed below, model_score represents the scoring metric from step 2, while model_time is the time spent training the model.
Training Mode Expression for Model Comparison TIME (model_score)/(model_time^1.2) BALANCE (model_score)/(model_time) SCORE model_score
For example, if the following three models were being compared:
Model model_score model_time Model A 0.7 500 Model B 0.85 600 Model C 0˙.87 800
In the TIME training mode, Model A would be selected.
In the BALANCE training mode, Model B would be selected.
In the SCORE training mode, Model C would be selected.