Skip to main content

Model Selection Process

If the label column is of type float or complex, AutoML trains a regression model using XGBRegressor.

For classification models, AutoML uses the following selection process to determine the most accurate model:

  1. If the dataset is too large, AutoML samples down the data to speed up the model selection process. The full dataset is still used for training after model selection.

    The size of the dataset is calculated by multiplying the number of columns by the number of rows. If this calculated size is larger than the target size, sampling is needed. The number of rows that can be utilized is calculated by dividing the target size by the number of columns. This number of rows is randomly selected from the entire dataset to be used only for the purposes of model selection.

  2. AutoML determines if the dataset presents a binary classification problem, or if multiple classes are present.

    • If it is a binary classification problem, the ROC AUC scoring metric is used.

    • Otherwise, the F1 scoring metric is used.

  3. These scoring metrics are then computed for each model using Monte Carlo cross validation, with three training/testing splits of 70%/30%. Depending on the training mode, the best model is determined as follows:


    For the mathematical expressions listed below, model_score represents the scoring metric from step 2, while model_time is the time spent training the model.

    Training Mode Expression for Model Comparison
    TIME (model_score)/(model_time^1.2)
    BALANCE (model_score)/(model_time)
    SCORE model_score

    For example, if the following three models were being compared:

    Model model_score model_time
    Model A 0.7 500
    Model B 0.85 600
    Model C 0˙.87 800

    In the TIME training mode, Model A would be selected.

    In the BALANCE training mode, Model B would be selected.

    In the SCORE training mode, Model C would be selected.

FeedbackOpens in a new tab