| Title: | Multiple Imputation by Super Learning |
|---|---|
| Description: | Performs multiple imputation of missing data using an ensemble super learner built with the tidymodels framework. For each incomplete column, a stacked ensemble of candidate learners is trained on a bootstrap sample of the observed data and used to generate imputations via predictive mean matching (continuous), probability draws (binary), or cumulative probability draws (categorical). Supports parallelism across imputed datasets via the future framework. |
| Authors: | Justin Manjourides [aut, cre] (ORCID: <https://orcid.org/0000-0002-2454-4489>), Thomas Carpenito [aut] (ORCID: <https://orcid.org/0000-0003-3591-0680>) |
| Maintainer: | Justin Manjourides <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.0.0 |
| Built: | 2026-05-30 09:36:14 UTC |
| Source: | https://github.com/justinmanjourides/misl |
Displays the built-in named learners available for use in
misl(). Note that any parsnip-compatible model spec can
also be passed directly via the *_method arguments.
list_learners(outcome_type = "all", installed_only = FALSE)list_learners(outcome_type = "all", installed_only = FALSE)
outcome_type |
One of |
installed_only |
If |
A tibble with columns learner, description,
package, installed, and outcome-type support flags
(when outcome_type = "all").
list_learners() list_learners("continuous") list_learners("ordinal") list_learners("categorical", installed_only = TRUE)list_learners() list_learners("continuous") list_learners("ordinal") list_learners("categorical", installed_only = TRUE)
Imputes missing values using multiple imputation by super learning.
misl( dataset, m = 5, maxit = 5, seed = NA, con_method = c("glm", "rand_forest", "boost_tree"), bin_method = c("glm", "rand_forest", "boost_tree"), cat_method = c("rand_forest", "boost_tree"), ord_method = c("polr", "rand_forest", "boost_tree"), cv_folds = 5, ignore_predictors = NA, quiet = TRUE )misl( dataset, m = 5, maxit = 5, seed = NA, con_method = c("glm", "rand_forest", "boost_tree"), bin_method = c("glm", "rand_forest", "boost_tree"), cat_method = c("rand_forest", "boost_tree"), ord_method = c("polr", "rand_forest", "boost_tree"), cv_folds = 5, ignore_predictors = NA, quiet = TRUE )
dataset |
A dataframe or matrix containing the incomplete data.
Missing values are represented with |
m |
The number of multiply imputed datasets to create. Default |
maxit |
The number of iterations per imputed dataset. Default |
seed |
Integer seed for reproducibility, or |
con_method |
Character vector of learner IDs, a list of parsnip model
specs, or a mixed list of both, for continuous columns.
Default |
bin_method |
Character vector of learner IDs, a list of parsnip model
specs, or a mixed list of both, for binary columns
(values must be |
cat_method |
Character vector of learner IDs, a list of parsnip model
specs, or a mixed list of both, for unordered categorical columns.
Default |
ord_method |
Character vector of learner IDs, a list of parsnip model
specs, or a mixed list of both, for ordered categorical columns.
Default |
cv_folds |
Integer number of cross-validation folds used when stacking
multiple learners. Reducing this (e.g. to |
ignore_predictors |
Character vector of column names to exclude as
predictors. Default |
quiet |
Suppress console progress messages. Default |
Built-in named learners (see list_learners()):
"glm" - base R (logistic for binary, linear for continuous)
"rand_forest" - ranger
"boost_tree" - xgboost
"mars" - earth
"multinom_reg" - nnet (unordered categorical only)
"polr" - MASS (ordered categorical only)
Any parsnip-compatible model spec can also be passed directly via the
*_method arguments. Named strings and parsnip specs can be mixed
in the same list:
library(parsnip)
misl(data,
con_method = list(
"glm",
rand_forest(trees = 500) |> set_engine("ranger")
)
)
The mode (regression vs classification) is always enforced by misl
regardless of what is set on the spec.
A list of m named lists, each with:
datasetsA fully imputed tibble.
traceA long-format tibble of mean/sd trace statistics per iteration, for convergence inspection.
Imputation across the m datasets is parallelised via
future.apply. To enable parallel execution, set a future plan
before calling misl():
library(future) plan(multisession, workers = 4) result <- misl(data, m = 5) plan(sequential)
# Using named learners (same as v1.0) set.seed(1) n <- 100 demo_data <- data.frame(x1 = rnorm(n), x2 = rnorm(n), y = rnorm(n)) demo_data[sample(n, 10), "y"] <- NA misl_imp <- misl(demo_data, m = 2, maxit = 2, con_method = "glm") # Using a custom parsnip spec ## Not run: library(parsnip) misl_imp <- misl( demo_data, m = 2, maxit = 2, con_method = list( "glm", rand_forest(trees = 500) |> set_engine("ranger") ) ) ## End(Not run)# Using named learners (same as v1.0) set.seed(1) n <- 100 demo_data <- data.frame(x1 = rnorm(n), x2 = rnorm(n), y = rnorm(n)) demo_data[sample(n, 10), "y"] <- NA misl_imp <- misl(demo_data, m = 2, maxit = 2, con_method = "glm") # Using a custom parsnip spec ## Not run: library(parsnip) misl_imp <- misl( demo_data, m = 2, maxit = 2, con_method = list( "glm", rand_forest(trees = 500) |> set_engine("ranger") ) ) ## End(Not run)
Plots the mean and standard deviation of imputed values across iterations for all incomplete variables, paginated in grids of up to 3 variables per page. Stable traces that mix well across datasets indicate convergence. Note that trace statistics are only computed for continuous and numeric binary columns – categorical and ordinal columns are excluded automatically.
plot_misl_trace(misl_result, ncol = 2, nrow = 3)plot_misl_trace(misl_result, ncol = 2, nrow = 3)
misl_result |
A list returned by |
ncol |
Number of columns per page. Default |
nrow |
Number of rows per page. Default |
Invisibly returns the long-format trace data frame used for plotting.
set.seed(1) n <- 100 demo_data <- data.frame(x1 = rnorm(n), x2 = rnorm(n), y = rnorm(n)) demo_data[sample(n, 10), "y"] <- NA misl_imp <- misl(demo_data, m = 3, maxit = 3, con_method = "glm") plot_misl_trace(misl_imp)set.seed(1) n <- 100 demo_data <- data.frame(x1 = rnorm(n), x2 = rnorm(n), y = rnorm(n)) demo_data[sample(n, 10), "y"] <- NA misl_imp <- misl(demo_data, m = 3, maxit = 3, con_method = "glm") plot_misl_trace(misl_imp)