AutoMLBench

This project is maintained by DataSystemsGroupUT

Results

Tables and Results Summaries

Comparison table of functionality of the AutoML frameworks

Comparison table of functionality of the AutoML frameworks considered in this study as of 24/12/2021

Wilcoxon pairwise test p-values for AutoML frameworks over different time budgets.

Wilcoxon pairwise test p-values for AutoML frameworks over different time budgets.

Mean_Succ, Mean and standard deviation of the predictive performance of AutoML frameworks

Mean_Succ, Mean and standard deviation of the predictive performance of AutoML frameworks.

Summary of the impact of increasing the time budget.

Summary of the impact of increasing the time budget.

Wilcoxon test p-values for all the AutoML frameworks

Wilcoxon test p-values for all the AutoML frameworks.

The performance of AutoSklearn-v and AutoSklearn-m and the gain in performance

The performance of AutoSklearn-v and AutoSklearn-m and the gain in performance.

Performance comparison between vanilla/base version vs ensembling version of AutoSKlearn and SmartML

Performance comparison between vanilla/base version vs ensembling version of AutoSKlearn and SmartML
Number of successful runs
Number of successful runs
Performance of the final pipeline per AutoML framework for 240 minutes
Performance of the final pipeline per AutoML framework for 240 minutes
Heatmaps
Heatmaps show the number of datasets a given AutoML framework outperforms another in terms of predictive performance over different time budgets. Two frameworks are considered to have the same performance on a task if they achieve predictive performance within 1% of each other.
Performance
Performance of the final pipeline on multi-class classification tasks
Performance
Performance of the final pipeline on datasets with large number of features and small number of instances.
Performance of the different AutoML frameworks based on the various characteristics of datasets and tasks over 240 minutes.
Evaluation
Evaluation of AutoML frameworks on robustness
Performance
The frequency of using different machine learning models by the different AutoML frameworks.
Performance
The impact of using a static portfolio on each AutoML framework. Green markers represent better performance with F C search space, blue markers represent comparable performance with a difference less than 1%, red markers represent better performance with 3C search space, yellow markers on the left represent failed runs with F C but successful with 3C, yellow markers on the right represent failed runs with 3C but successful with F C, and yellow markers in the middle represent failed runs with both F C and 3C