| Title: | LASSO Auxiliary Variable Selection and Auxiliary Vector Diagnostics |
|---|---|
| Description: | Provides tools for assessing and selecting auxiliary variables using LASSO. The package includes functions for variable selection and diagnostics, facilitating survey calibration analysis with emphasis on robust auxiliary vector selection. For more details see Tibshirani (1996) <doi:10.1111/j.2517-6161.1996.tb02080.x> and Caughrey and Hartman (2017) <doi:10.2139/ssrn.3494436>. |
| Authors: | Gustaf Andersson [aut, cre, cph] |
| Maintainer: | Gustaf Andersson <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-05-28 06:50:21 UTC |
| Source: | https://github.com/gustafanderssons/auxveclasso-r-package |
This function assesses the calibration of auxiliary variables in a survey design, performs various diagnostics, and optionally calibrates weights based on a specified calibration formula. It provides diagnostics on weight variation, register data alignment, and survey data alignment. The results are returned as a list of class "assess_aux_vector".
assess_aux_vector( design, df, calibration_formula = NULL, calibration_pop_totals = NULL, register_vars = NULL, register_pop_means = NULL, survey_vars = NULL, domain_vars = NULL, diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE )assess_aux_vector( design, df, calibration_formula = NULL, calibration_pop_totals = NULL, register_vars = NULL, register_pop_means = NULL, survey_vars = NULL, domain_vars = NULL, diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE )
design |
A survey design object, typically of class |
df |
A data frame containing the survey data to be used in the analysis. |
calibration_formula |
An optional formula object specifying the auxiliary variables
used for calibration (e.g., |
calibration_pop_totals |
An optional list of population totals for the auxiliary
variables in |
register_vars |
A character vector specifying the names of the auxiliary variables
from the register data that should be used in the diagnostics. If |
register_pop_means |
A list containing population means for the register variables. The list may include a "total" entry for the total population means and/or a "by_domain" entry for domain-specific population means. |
survey_vars |
A character vector specifying the names of the survey variables
to be used in the diagnostics. If |
domain_vars |
A character vector specifying the domain variables used to group
data for domain-specific diagnostics. If |
diagnostics |
A character vector specifying which diagnostics to compute. Possible values include:
The default is all three. |
already_calibrated |
A logical flag indicating whether the weights have already
been calibrated. If |
verbose |
A logical flag indicating whether to print additional messages during the execution of the function. This can be useful for debugging or monitoring progress. |
The function supports several diagnostic checks, including weight variation diagnostics, register diagnostics (total and by domain), and survey diagnostics (total and by domain).
The function may also calibrate survey weights based on a provided calibration formula and population totals. Calibration can be skipped if the weights are already calibrated.
The function supports several diagnostic checks, including weight variation diagnostics, register diagnostics (total and by domain), and survey diagnostics (total and by domain).
The function may also calibrate survey weights based on a provided calibration formula and population totals. Calibration can be skipped if the weights are already calibrated.
The weight diagnostics contain the following measures:
Descriptive statistics (min, max, median, mean, standard deviation (sd), range, bottom percentile, top percentile)
Inequality measures (coefficient of variation, Gini index, entropy)
Skewness and (excess) kurtosis
A list of class "assess_aux_vector" containing the results of the
diagnostic assessments. The list includes the following components:
weight_variation: A numeric vector or matrix containing the results
of the weight variation diagnostics.
register_diagnostics: A list containing diagnostics based on the
register data. This may include the total diagnostics and/or domain-specific
diagnostics.
survey_diagnostics: A list containing diagnostics based on the survey
data. This may include the total diagnostics and/or domain-specific diagnostics.
calibrate for the calibration function.
## ============================================================ ## Example 1: Calibrate weights, then run all diagnostics ## (register + survey, with a by-domain breakdown) ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(42) options(survey.lonely.psu = "adjust") ## --- Simulate a tiny sample n <- 200 sex <- factor(sample(c("F", "M"), n, replace = TRUE)) sex[1:2] <- c("F", "M") sex <- factor(sex, levels = c("F", "M")) region <- factor(sample(c("N", "S"), n, replace = TRUE)) region[1:2] <- c("N", "S") region <- factor(region, levels = c("N", "S")) age <- round(rnorm(n, mean = 41, sd = 12)) ## Register variable we have population means for: reg_income <- 50000 + 2000 * (region == "S") + rnorm(n, sd = 4000) ## A couple of survey variables to diagnose: y1 <- 10 + 2 * (sex == "M") + rnorm(n, sd = 2) y2 <- 100 + 5 * (region == "S") + rnorm(n, sd = 5) ## Some unequal weights (to make weight-variation meaningful) w <- runif(n, 0.6, 2.2) * 50 df <- data.frame(sex, region, age, reg_income, y1, y2, w) design <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## --- Calibration setup (simple main-effects formula) ## Model matrix columns will be: (Intercept), sexM, regionS, age Npop <- 5000 pop_mean_age <- 40 calibration_formula <- ~ sex + region + age calibration_pop_totals <- c( "(Intercept)" = Npop, "sexM" = round(0.45 * Npop), # 45% of population is male "regionS" = round(0.40 * Npop), # 40% in region S "age" = pop_mean_age * Npop # totals (mean * N) ) ## --- Register population means: total + by domain (single register var) register_vars <- "reg_income" register_pop_means <- list( total = c(reg_income = 51000), # overall pop mean by_domain = list( region = c(N = 50000, S = 52000) # domain-specific pop means ) ) out1 <- assess_aux_vector( design = design, df = df, calibration_formula = calibration_formula, calibration_pop_totals = calibration_pop_totals, register_vars = register_vars, register_pop_means = register_pop_means, survey_vars = c("y1", "y2"), domain_vars = c("region"), diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE ) ## Peek at key outputs: out1$weight_variation out1$register_diagnostics$total out1$register_diagnostics$by_domain$region out1$survey_diagnostics$total } ## ============================================================ ## Example 2: Skip calibration; survey diagnostics by domain ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(99) options(survey.lonely.psu = "adjust") n <- 120 region <- factor(sample(c("N", "S"), n, replace = TRUE)) region[1:2] <- c("N", "S") region <- factor(region, levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) sex[1:2] <- c("F", "M") sex <- factor(sex, levels = c("F", "M")) age <- round(rnorm(n, 39, 11)) yA <- rnorm(n, mean = 50 + 3 * (region == "S")) yB <- rnorm(n, mean = 30 + 1.5 * (sex == "M")) w <- runif(n, 0.7, 1.8) * 40 toy <- data.frame(region, sex, age, yA, yB, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = toy) out2 <- assess_aux_vector( design = des, df = toy, calibration_formula = NULL, # skip calibration calibration_pop_totals = NULL, register_vars = NULL, # no register diagnostics survey_vars = c("yA", "yB"), domain_vars = "region", diagnostics = c("weight_variation", "survey_diagnostics"), already_calibrated = TRUE, # explicitly skip calibration verbose = FALSE ) out2$weight_variation out2$survey_diagnostics$by_domain$region }## ============================================================ ## Example 1: Calibrate weights, then run all diagnostics ## (register + survey, with a by-domain breakdown) ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(42) options(survey.lonely.psu = "adjust") ## --- Simulate a tiny sample n <- 200 sex <- factor(sample(c("F", "M"), n, replace = TRUE)) sex[1:2] <- c("F", "M") sex <- factor(sex, levels = c("F", "M")) region <- factor(sample(c("N", "S"), n, replace = TRUE)) region[1:2] <- c("N", "S") region <- factor(region, levels = c("N", "S")) age <- round(rnorm(n, mean = 41, sd = 12)) ## Register variable we have population means for: reg_income <- 50000 + 2000 * (region == "S") + rnorm(n, sd = 4000) ## A couple of survey variables to diagnose: y1 <- 10 + 2 * (sex == "M") + rnorm(n, sd = 2) y2 <- 100 + 5 * (region == "S") + rnorm(n, sd = 5) ## Some unequal weights (to make weight-variation meaningful) w <- runif(n, 0.6, 2.2) * 50 df <- data.frame(sex, region, age, reg_income, y1, y2, w) design <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## --- Calibration setup (simple main-effects formula) ## Model matrix columns will be: (Intercept), sexM, regionS, age Npop <- 5000 pop_mean_age <- 40 calibration_formula <- ~ sex + region + age calibration_pop_totals <- c( "(Intercept)" = Npop, "sexM" = round(0.45 * Npop), # 45% of population is male "regionS" = round(0.40 * Npop), # 40% in region S "age" = pop_mean_age * Npop # totals (mean * N) ) ## --- Register population means: total + by domain (single register var) register_vars <- "reg_income" register_pop_means <- list( total = c(reg_income = 51000), # overall pop mean by_domain = list( region = c(N = 50000, S = 52000) # domain-specific pop means ) ) out1 <- assess_aux_vector( design = design, df = df, calibration_formula = calibration_formula, calibration_pop_totals = calibration_pop_totals, register_vars = register_vars, register_pop_means = register_pop_means, survey_vars = c("y1", "y2"), domain_vars = c("region"), diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE ) ## Peek at key outputs: out1$weight_variation out1$register_diagnostics$total out1$register_diagnostics$by_domain$region out1$survey_diagnostics$total } ## ============================================================ ## Example 2: Skip calibration; survey diagnostics by domain ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(99) options(survey.lonely.psu = "adjust") n <- 120 region <- factor(sample(c("N", "S"), n, replace = TRUE)) region[1:2] <- c("N", "S") region <- factor(region, levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) sex[1:2] <- c("F", "M") sex <- factor(sex, levels = c("F", "M")) age <- round(rnorm(n, 39, 11)) yA <- rnorm(n, mean = 50 + 3 * (region == "S")) yB <- rnorm(n, mean = 30 + 1.5 * (sex == "M")) w <- runif(n, 0.7, 1.8) * 40 toy <- data.frame(region, sex, age, yA, yB, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = toy) out2 <- assess_aux_vector( design = des, df = toy, calibration_formula = NULL, # skip calibration calibration_pop_totals = NULL, register_vars = NULL, # no register diagnostics survey_vars = c("yA", "yB"), domain_vars = "region", diagnostics = c("weight_variation", "survey_diagnostics"), already_calibrated = TRUE, # explicitly skip calibration verbose = FALSE ) out2$weight_variation out2$survey_diagnostics$by_domain$region }
Computes survey-weighted means and standard errors for one or more variables in a survey design, optionally stratified by domains (e.g., regions, groups). If population means are provided, it also calculates the bias and mean squared error (MSE) of the estimates compared to the population means.
estimate_mean_stats(design, vars, by = NULL, population_means = NULL)estimate_mean_stats(design, vars, by = NULL, population_means = NULL)
design |
A survey design object (e.g., from the survey package). The object must contain survey data and weights. |
vars |
A character vector of variable names to compute means for. These variables must
exist in |
by |
Optional. A one-sided formula (e.g., |
population_means |
Optional. A named numeric vector with population means (for |
Observations with non-finite weights are excluded from the analysis globally.
For each variable, observations with non-finite values are also dropped (in addition to the global weight filter). An error will occur if no valid data remains.
For domain-specific means, population_means must include all domain columns and the variable
being estimated. The rows are merged by domain key.
A named list where each component is a data frame with the following columns:
Columns for the domain variables (if by is specified).
The name of the variable.
The survey-weighted mean of the variable.
The standard error of the survey-weighted mean.
The difference between the survey-weighted mean and the population mean. NA if
population_means is not provided.
The mean squared error (MSE) of the survey-weighted mean. NA if population_means
is not provided.
## ============================================================ ## Example 1: Overall means with population means (bias/MSE) ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(123) options(survey.lonely.psu = "adjust") n <- 200 region <- factor(sample(c("N", "S"), n, replace = TRUE)) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) y <- 10 + 2 * (sex == "M") + rnorm(n, sd = 1.5) z <- 100 + 5 * (region == "S") + rnorm(n, sd = 3) w <- runif(n, 0.8, 1.8) * 50 df <- data.frame(region, sex, y, z, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## Named vector of population means for overall case pop_means <- c(y = 11.2, z = 103.0) res_overall <- estimate_mean_stats( design = des, vars = c("y", "z"), by = NULL, population_means = pop_means ) ## Each element is a one-row data frame res_overall$y res_overall$z } ## ============================================================ ## Example 2: Domain means by region with population means table ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(456) options(survey.lonely.psu = "adjust") n <- 150 region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) y <- 12 + 1.5 * (region == "S") + rnorm(n, sd = 1.2) z <- 95 + 6 * (region == "S") + rnorm(n, sd = 2.5) w <- runif(n, 0.7, 2.0) * 40 toy <- data.frame(region, sex, y, z, w) des2 <- survey::svydesign(ids = ~1, weights = ~w, data = toy) ## Population means by domain must include the domain column(s) + vars pop_by_region <- data.frame( region = c("N", "S"), y = c(12.2, 13.8), z = c(95.5, 101.0), stringsAsFactors = FALSE ) res_by <- estimate_mean_stats( design = des2, vars = c("y", "z"), by = ~region, # or equivalently: by = "region" population_means = pop_by_region # merged by the 'region' key ) ## Each element is a data frame with domain rows res_by$y res_by$z }## ============================================================ ## Example 1: Overall means with population means (bias/MSE) ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(123) options(survey.lonely.psu = "adjust") n <- 200 region <- factor(sample(c("N", "S"), n, replace = TRUE)) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) y <- 10 + 2 * (sex == "M") + rnorm(n, sd = 1.5) z <- 100 + 5 * (region == "S") + rnorm(n, sd = 3) w <- runif(n, 0.8, 1.8) * 50 df <- data.frame(region, sex, y, z, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## Named vector of population means for overall case pop_means <- c(y = 11.2, z = 103.0) res_overall <- estimate_mean_stats( design = des, vars = c("y", "z"), by = NULL, population_means = pop_means ) ## Each element is a one-row data frame res_overall$y res_overall$z } ## ============================================================ ## Example 2: Domain means by region with population means table ## ============================================================ if (requireNamespace("survey", quietly = TRUE)) { set.seed(456) options(survey.lonely.psu = "adjust") n <- 150 region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE)) y <- 12 + 1.5 * (region == "S") + rnorm(n, sd = 1.2) z <- 95 + 6 * (region == "S") + rnorm(n, sd = 2.5) w <- runif(n, 0.7, 2.0) * 40 toy <- data.frame(region, sex, y, z, w) des2 <- survey::svydesign(ids = ~1, weights = ~w, data = toy) ## Population means by domain must include the domain column(s) + vars pop_by_region <- data.frame( region = c("N", "S"), y = c(12.2, 13.8), z = c(95.5, 101.0), stringsAsFactors = FALSE ) res_by <- estimate_mean_stats( design = des2, vars = c("y", "z"), by = ~region, # or equivalently: by = "region" population_means = pop_by_region # merged by the 'region' key ) ## Each element is a data frame with domain rows res_by$y res_by$z }
Fits a LASSO regression model (logistic regression for binary outcomes or linear regression for continuous outcomes) for a single outcome variable using cross-validation.
The function drops rows where the outcome is missing, and ensures that the predictors do not
have missing values. The model is fitted using the glmnet package, with the option to apply
cross-validation for selecting the optimal regularization parameter (lambda).
fit_outcome( yvar, df, X, penalty_factors, nfolds = 5, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE )fit_outcome( yvar, df, X, penalty_factors, nfolds = 5, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE )
yvar |
Character scalar. The name of the outcome variable in |
df |
Data frame containing the outcome and predictors.
The outcome variable ( |
X |
Model matrix (rows must align with |
penalty_factors |
Named numeric vector of penalty factors for each predictor variable.
The names should match the column names of |
nfolds |
Number of folds for cross-validation. Default is 5. |
standardize |
Logical; should the predictors be standardized before fitting the model? Default is |
parallel |
Logical; should the cross-validation be performed in parallel? Default is |
return_models |
Logical; should the fitted |
verbose |
Logical; if |
The outcome variable (yvar) can be either binary or continuous:
Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.
Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.
Rows with missing values for the outcome (yvar) or predictors (X) will be dropped.
The function uses the glmnet package to fit the LASSO model.
Cross-validation is used to select the optimal regularization parameter (lambda).
Model performance metrics, including AUC, accuracy, Brier score (for binary outcomes) or RSS, MSE, RMSE, MAE, R-squared (for continuous outcomes), are computed on the non-missing rows.
A list containing:
A character vector of the names of the selected variables (non-zero coefficients).
The value of lambda that minimizes the cross-validation error.
A list containing performance metrics for the model:
cross_validated: A list with cv_error (cross-validation error) and cv_error_sd (standard deviation of CV error).
full_data: A list with:
deviance_explained: Proportion of deviance explained by the model (only for binary outcomes).
auc: Area under the ROC curve (only for binary outcomes).
accuracy: Classification accuracy (only for binary outcomes).
brier_score: Brier score (mean squared error for probabilities) (only for binary outcomes).
rss: Residual sum of squares (only for continuous outcomes).
mse: Mean squared error (only for continuous outcomes).
rmse: Root mean squared error (only for continuous outcomes).
mae: Mean absolute error (only for continuous outcomes).
r_squared: R-squared (only for continuous outcomes).
raw_coefs: Raw coefficients from the LASSO model.
abs_coefs: Absolute values of the coefficients.
The fitted cv.glmnet model, if return_models = TRUE. Otherwise, NULL.
## ============================================================ ## Example 1: Binary outcome (binomial LASSO with CV) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("pROC", quietly = TRUE)) { set.seed(101) n <- 180 x1 <- rnorm(n) x2 <- rnorm(n) f <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Construct a binary outcome with signal in x2 and x1:x2 lin <- -0.3 + 1.0 * x2 + 0.6 * (x1 * x2) - 0.7 * (f == "C") p <- 1 / (1 + exp(-lin)) yfac <- factor(rbinom(n, 1, p), labels = c("No", "Yes")) # 2-level factor df <- data.frame(y = yfac, x1 = x1, x2 = x2, f = f) ## Model matrix with main effects + one interaction, no intercept X <- model.matrix(~ x1 + x2 + f + x1:x2 - 1, data = df) ## Penalty factors must match X's columns (names + length). penalty_factors <- rep(1, ncol(X)) names(penalty_factors) <- colnames(X) ## (Optional) keep x1 unpenalized: if ("x1" %in% names(penalty_factors)) penalty_factors["x1"] <- 0 fit_bin <- fit_outcome( yvar = "y", df = df, X = X, penalty_factors = penalty_factors, nfolds = 3, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE ) ## Peek at the results fit_bin$selected fit_bin$lambda_min fit_bin$goodness$full_data$auc fit_bin$goodness$full_data$accuracy } ## ============================================================ ## Example 2: Continuous outcome (gaussian LASSO with CV) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(202) n <- 160 x1 <- rnorm(n) x2 <- rnorm(n) f <- factor(sample(c("L", "H"), n, replace = TRUE)) y <- 1.5 * x1 + 0.8 * x2 - 1.0 * (f == "H") + 0.6 * (x1 * x2) + rnorm(n, sd = 0.7) df <- data.frame(y = y, x1 = x1, x2 = x2, f = f) ## Main effects only, no intercept X <- model.matrix(~ x1 + x2 + f - 1, data = df) penalty_factors <- rep(1, ncol(X)) names(penalty_factors) <- colnames(X) fit_cont <- fit_outcome( yvar = "y", df = df, X = X, penalty_factors = penalty_factors, nfolds = 3, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE ) ## Key metrics fit_cont$selected fit_cont$lambda_min fit_cont$goodness$full_data$mse fit_cont$goodness$full_data$r_squared }## ============================================================ ## Example 1: Binary outcome (binomial LASSO with CV) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("pROC", quietly = TRUE)) { set.seed(101) n <- 180 x1 <- rnorm(n) x2 <- rnorm(n) f <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Construct a binary outcome with signal in x2 and x1:x2 lin <- -0.3 + 1.0 * x2 + 0.6 * (x1 * x2) - 0.7 * (f == "C") p <- 1 / (1 + exp(-lin)) yfac <- factor(rbinom(n, 1, p), labels = c("No", "Yes")) # 2-level factor df <- data.frame(y = yfac, x1 = x1, x2 = x2, f = f) ## Model matrix with main effects + one interaction, no intercept X <- model.matrix(~ x1 + x2 + f + x1:x2 - 1, data = df) ## Penalty factors must match X's columns (names + length). penalty_factors <- rep(1, ncol(X)) names(penalty_factors) <- colnames(X) ## (Optional) keep x1 unpenalized: if ("x1" %in% names(penalty_factors)) penalty_factors["x1"] <- 0 fit_bin <- fit_outcome( yvar = "y", df = df, X = X, penalty_factors = penalty_factors, nfolds = 3, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE ) ## Peek at the results fit_bin$selected fit_bin$lambda_min fit_bin$goodness$full_data$auc fit_bin$goodness$full_data$accuracy } ## ============================================================ ## Example 2: Continuous outcome (gaussian LASSO with CV) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(202) n <- 160 x1 <- rnorm(n) x2 <- rnorm(n) f <- factor(sample(c("L", "H"), n, replace = TRUE)) y <- 1.5 * x1 + 0.8 * x2 - 1.0 * (f == "H") + 0.6 * (x1 * x2) + rnorm(n, sd = 0.7) df <- data.frame(y = y, x1 = x1, x2 = x2, f = f) ## Main effects only, no intercept X <- model.matrix(~ x1 + x2 + f - 1, data = df) penalty_factors <- rep(1, ncol(X)) names(penalty_factors) <- colnames(X) fit_cont <- fit_outcome( yvar = "y", df = df, X = X, penalty_factors = penalty_factors, nfolds = 3, standardize = TRUE, parallel = FALSE, return_models = FALSE, verbose = FALSE ) ## Key metrics fit_cont$selected fit_cont$lambda_min fit_cont$goodness$full_data$mse fit_cont$goodness$full_data$r_squared }
Build a fixed model matrix on a population frame and return the
column totals needed for calibration (optionally weighted). The function
freezes dummy/interaction structure on the population by constructing
a terms object, so downstream use on respondent data can reuse
the exact same encoding.
generate_population_totals( population_df, calibration_formula, weights = NULL, contrasts = NULL, include_intercept = TRUE, sparse = FALSE, na_action = stats::na.pass, drop_zero_cols = FALSE )generate_population_totals( population_df, calibration_formula, weights = NULL, contrasts = NULL, include_intercept = TRUE, sparse = FALSE, na_action = stats::na.pass, drop_zero_cols = FALSE )
population_df |
A data frame containing the calibration population. |
calibration_formula |
A one-sided formula specifying main effects and interactions
(e.g., |
weights |
Optional numeric vector of population weights (length |
contrasts |
Optional named list of contrasts to pass to |
include_intercept |
Logical; if |
sparse |
Logical; if |
na_action |
NA handling passed to |
drop_zero_cols |
Logical; if |
An object of class "calib_totals": a list with
population_totals: named numeric vector of column totals
levels: list of factor levels observed in the population (for reproducibility)
terms: the terms object built on population_df
contrasts: the contrasts actually used (from the model matrix)
# Example using the API data from the survey package library(survey) data(api) # loads apipop, apisrs, apistrat, etc. # Build a population frame and create some binary fields used in a formula pop <- apipop pop$api00_bin <- as.factor(ifelse(pop$api00 >= 700, "700plus", "lt700")) pop$growth_bin <- as.factor(ifelse(pop$growth >= 0, "nonneg", "neg")) pop$ell_bin <- as.factor(ifelse(pop$ell >= 10, "highELL", "lowELL")) pop$comp.imp_bin <- as.factor(ifelse(pop$comp.imp >= 50, "highComp", "lowComp")) pop$hsg_bin <- as.factor(ifelse(pop$hsg >= 60, "highHSG", "lowHSG")) # A calibration formula with main effects + a few interactions cal_formula <- ~ stype + growth_bin + api00_bin + ell_bin + comp.imp_bin + hsg_bin + api00_bin:stype + hsg_bin:stype + comp.imp_bin:stype + api00_bin:growth_bin # (Optional) frame weights if available; here we use unweighted totals gp <- generate_population_totals( population_df = pop, calibration_formula = cal_formula, include_intercept = TRUE ) # Named totals ready for calibration: head(gp$population_totals) # If you later build a respondent model matrix, reuse gp$terms to ensure alignment: # X_resp <- model.matrix(gp$terms, data = apisrs) # stopifnot(identical(colnames(X_resp), names(gp$population_totals)))# Example using the API data from the survey package library(survey) data(api) # loads apipop, apisrs, apistrat, etc. # Build a population frame and create some binary fields used in a formula pop <- apipop pop$api00_bin <- as.factor(ifelse(pop$api00 >= 700, "700plus", "lt700")) pop$growth_bin <- as.factor(ifelse(pop$growth >= 0, "nonneg", "neg")) pop$ell_bin <- as.factor(ifelse(pop$ell >= 10, "highELL", "lowELL")) pop$comp.imp_bin <- as.factor(ifelse(pop$comp.imp >= 50, "highComp", "lowComp")) pop$hsg_bin <- as.factor(ifelse(pop$hsg >= 60, "highHSG", "lowHSG")) # A calibration formula with main effects + a few interactions cal_formula <- ~ stype + growth_bin + api00_bin + ell_bin + comp.imp_bin + hsg_bin + api00_bin:stype + hsg_bin:stype + comp.imp_bin:stype + api00_bin:growth_bin # (Optional) frame weights if available; here we use unweighted totals gp <- generate_population_totals( population_df = pop, calibration_formula = cal_formula, include_intercept = TRUE ) # Named totals ready for calibration: head(gp$population_totals) # If you later build a respondent model matrix, reuse gp$terms to ensure alignment: # X_resp <- model.matrix(gp$terms, data = apisrs) # stopifnot(identical(colnames(X_resp), names(gp$population_totals)))
S3 print method for objects of class assess_aux_vector. Displays a
formatted, colorized summary of weight variation metrics, register diagnostics
(overall and by domain), and survey diagnostics (overall and by domain).
## S3 method for class 'assess_aux_vector' print(x, ...)## S3 method for class 'assess_aux_vector' print(x, ...)
x |
An object of class
|
... |
Additional arguments (currently ignored). |
In addition to means and standard errors, the printer shows the
relative standard error (RSE = SE / |mean|) and—when population
means are supplied to estimate_mean_stats()—two-sided p-values for the bias
testing .
Requires the crayon package for colored output.
The print method outputs sections with colored headers for easier readability:
Weight Variation Metrics
Register Diagnostics summarized for all units and by domain
Survey Diagnostics summarized for all units and by domain
For each variable shown, the following metrics are printed when present:
Mean — survey-weighted mean.
SE — design-based standard error from survey.
RSE — relative standard error, .
Bias — difference between estimate and population mean (if supplied).
MSE — (if population means supplied).
p(Bias) — two-sided p-value testing ,
computed as with (shown when
population means are available).
Edge cases: if mean == 0 the RSE is reported as NA; if SE == 0,
p(Bias) is 1 when |Bias| is numerically zero and 0 otherwise.
Objects created with earlier versions that lack rse or p_bias columns
are handled gracefully (those fields are simply not printed).
If the crayon package is not installed, the function will stop with an error.
Invisibly returns the input object x.
## ============================================================ ## Example 1: Print with register + survey diagnostics ## (includes population means -> prints p(Bias)) ## ============================================================ if (requireNamespace("survey", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(7) options(survey.lonely.psu = "adjust") ## --- Simulate a small survey n <- 180 sex <- factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")) region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) age <- round(rnorm(n, mean = 42, sd = 12)) reg_income <- 52000 + 1500 * (region == "S") + rnorm(n, sd = 3500) # register var y1 <- 10 + 1.8 * (sex == "M") + rnorm(n, sd = 2) # survey vars y2 <- 95 + 4.5 * (region == "S") + rnorm(n, sd = 3.5) w <- runif(n, 0.7, 2.1) * 40 df <- data.frame(sex, region, age, reg_income, y1, y2, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## --- Optional calibration inputs (simple main effects) ## Model matrix columns: (Intercept), sexM, regionS, age Npop <- 4000 calibration_formula <- ~ sex + region + age calibration_pop_totals <- c( "(Intercept)" = Npop, "sexM" = round(0.48 * Npop), "regionS" = round(0.52 * Npop), "age" = 41 * Npop ) ## --- Population means for the register var: total + by domain register_vars <- "reg_income" register_pop_means <- list( total = c(reg_income = 52500), by_domain = list( region = c(N = 51500, S = 53500) ) ) ## --- Build assessment object aux1 <- assess_aux_vector( design = des, df = df, calibration_formula = calibration_formula, calibration_pop_totals = calibration_pop_totals, register_vars = register_vars, register_pop_means = register_pop_means, survey_vars = c("y1", "y2"), domain_vars = "region", diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE ) ## Colorized, formatted summary: print(aux1) } ## ============================================================ ## Example 2: Print with survey diagnostics only (by domain) ## (no population means -> p(Bias) omitted) ## ============================================================ if (requireNamespace("survey", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(11) options(survey.lonely.psu = "adjust") n <- 120 region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")) yA <- 50 + 2.5 * (region == "S") + rnorm(n, sd = 2) yB <- 30 + 1.5 * (sex == "M") + rnorm(n, sd = 1.5) w <- runif(n, 0.8, 1.9) * 35 toy <- data.frame(region, sex, yA, yB, w) des2 <- survey::svydesign(ids = ~1, weights = ~w, data = toy) aux2 <- assess_aux_vector( design = des2, df = toy, calibration_formula = NULL, # skip calibration calibration_pop_totals = NULL, register_vars = NULL, # no register diagnostics survey_vars = c("yA", "yB"), domain_vars = "region", diagnostics = c("weight_variation", "survey_diagnostics"), already_calibrated = TRUE, verbose = FALSE ) print(aux2) }## ============================================================ ## Example 1: Print with register + survey diagnostics ## (includes population means -> prints p(Bias)) ## ============================================================ if (requireNamespace("survey", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(7) options(survey.lonely.psu = "adjust") ## --- Simulate a small survey n <- 180 sex <- factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")) region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) age <- round(rnorm(n, mean = 42, sd = 12)) reg_income <- 52000 + 1500 * (region == "S") + rnorm(n, sd = 3500) # register var y1 <- 10 + 1.8 * (sex == "M") + rnorm(n, sd = 2) # survey vars y2 <- 95 + 4.5 * (region == "S") + rnorm(n, sd = 3.5) w <- runif(n, 0.7, 2.1) * 40 df <- data.frame(sex, region, age, reg_income, y1, y2, w) des <- survey::svydesign(ids = ~1, weights = ~w, data = df) ## --- Optional calibration inputs (simple main effects) ## Model matrix columns: (Intercept), sexM, regionS, age Npop <- 4000 calibration_formula <- ~ sex + region + age calibration_pop_totals <- c( "(Intercept)" = Npop, "sexM" = round(0.48 * Npop), "regionS" = round(0.52 * Npop), "age" = 41 * Npop ) ## --- Population means for the register var: total + by domain register_vars <- "reg_income" register_pop_means <- list( total = c(reg_income = 52500), by_domain = list( region = c(N = 51500, S = 53500) ) ) ## --- Build assessment object aux1 <- assess_aux_vector( design = des, df = df, calibration_formula = calibration_formula, calibration_pop_totals = calibration_pop_totals, register_vars = register_vars, register_pop_means = register_pop_means, survey_vars = c("y1", "y2"), domain_vars = "region", diagnostics = c("weight_variation", "register_diagnostics", "survey_diagnostics"), already_calibrated = FALSE, verbose = FALSE ) ## Colorized, formatted summary: print(aux1) } ## ============================================================ ## Example 2: Print with survey diagnostics only (by domain) ## (no population means -> p(Bias) omitted) ## ============================================================ if (requireNamespace("survey", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(11) options(survey.lonely.psu = "adjust") n <- 120 region <- factor(sample(c("N", "S"), n, replace = TRUE), levels = c("N", "S")) sex <- factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")) yA <- 50 + 2.5 * (region == "S") + rnorm(n, sd = 2) yB <- 30 + 1.5 * (sex == "M") + rnorm(n, sd = 1.5) w <- runif(n, 0.8, 1.9) * 35 toy <- data.frame(region, sex, yA, yB, w) des2 <- survey::svydesign(ids = ~1, weights = ~w, data = toy) aux2 <- assess_aux_vector( design = des2, df = toy, calibration_formula = NULL, # skip calibration calibration_pop_totals = NULL, register_vars = NULL, # no register diagnostics survey_vars = c("yA", "yB"), domain_vars = "region", diagnostics = c("weight_variation", "survey_diagnostics"), already_calibrated = TRUE, verbose = FALSE ) print(aux2) }
S3 print method for objects of class select_auxiliary_variables_lasso_cv.
Displays a formatted and colorized summary of the selected auxiliary variables,
their grouping by outcome, selected penalty parameters (lambdas),
penalty factors, stored models, goodness-of-fit metrics, coefficient estimates,
and interaction metadata.
## S3 method for class 'select_auxiliary_variables_lasso_cv' print(x, ...)## S3 method for class 'select_auxiliary_variables_lasso_cv' print(x, ...)
x |
An object of class
|
... |
Additional arguments (currently ignored). |
Requires the crayon package for colored output.
The print method outputs information using colored text (via crayon), making it easier to visually parse the summary. It organizes output into sections:
Selected variables and their counts
Variables selected by each outcome
Selected lambda tuning parameters
Summary of penalty factors
Number of stored models
Goodness-of-fit metrics for each outcome, including cross-validation error statistics and metrics on full data fit
Coefficients at the lambda minimizing error, ordered by magnitude
Interaction terms and main effects metadata
If the crayon package is not installed, the function will stop with an error.
Invisibly returns the input object x.
## ============================================================ ## Example 1: Binary + continuous outcomes, with interactions ## (prints selected vars, lambdas, GOF, coef table, interactions) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("pROC", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(123) n <- 180 x1 <- rnorm(n) x2 <- rnorm(n) grp <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Binary outcome with signal in x2, grp, and x1:x2 (make it a 2-level factor) eta <- -0.4 + 1.0 * x2 - 0.8 * (grp == "C") + 0.6 * (x1 * x2) p <- 1 / (1 + exp(-eta)) y_bin <- factor(rbinom(n, 1, p), labels = c("No", "Yes")) ## Continuous outcome with some interaction y_cont <- 1.4 * x1 + 0.9 * x2 - 1.1 * (grp == "B") + 0.5 * (x1 * x2) + rnorm(n, sd = 0.7) df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, grp = grp) lasso_obj1 <- select_auxiliary_variables_lasso_cv( df = df, outcome_vars = c("y_bin", "y_cont"), auxiliary_vars = c("x1", "x2", "grp"), must_have_vars = c("x1", "grp"), # 'grp' expands to its dummy columns check_twoway_int = TRUE, # include all two-way interactions nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = FALSE, # models not stored (printer still shows GOF & coefs) parallel = FALSE ) ## Colorized, formatted summary: print(lasso_obj1) } ## ============================================================ ## Example 2: Single continuous outcome, main effects only ## (stores model so the printer reports it) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(456) n <- 140 a <- rnorm(n) b <- rnorm(n) f <- factor(sample(c("L", "H"), n, replace = TRUE)) y <- 2 * a + 0.8 * b - 1.2 * (f == "H") + rnorm(n, sd = 0.8) toy <- data.frame(y = y, a = a, b = b, f = f) lasso_obj2 <- select_auxiliary_variables_lasso_cv( df = toy, outcome_vars = "y", auxiliary_vars = c("a", "b", "f"), must_have_vars = "f", # keep factor (its dummies get zero penalty) check_twoway_int = FALSE, # main effects only nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = TRUE, # store cv.glmnet model parallel = FALSE ) print(lasso_obj2) }## ============================================================ ## Example 1: Binary + continuous outcomes, with interactions ## (prints selected vars, lambdas, GOF, coef table, interactions) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("pROC", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(123) n <- 180 x1 <- rnorm(n) x2 <- rnorm(n) grp <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Binary outcome with signal in x2, grp, and x1:x2 (make it a 2-level factor) eta <- -0.4 + 1.0 * x2 - 0.8 * (grp == "C") + 0.6 * (x1 * x2) p <- 1 / (1 + exp(-eta)) y_bin <- factor(rbinom(n, 1, p), labels = c("No", "Yes")) ## Continuous outcome with some interaction y_cont <- 1.4 * x1 + 0.9 * x2 - 1.1 * (grp == "B") + 0.5 * (x1 * x2) + rnorm(n, sd = 0.7) df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, grp = grp) lasso_obj1 <- select_auxiliary_variables_lasso_cv( df = df, outcome_vars = c("y_bin", "y_cont"), auxiliary_vars = c("x1", "x2", "grp"), must_have_vars = c("x1", "grp"), # 'grp' expands to its dummy columns check_twoway_int = TRUE, # include all two-way interactions nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = FALSE, # models not stored (printer still shows GOF & coefs) parallel = FALSE ) ## Colorized, formatted summary: print(lasso_obj1) } ## ============================================================ ## Example 2: Single continuous outcome, main effects only ## (stores model so the printer reports it) ## ============================================================ if (requireNamespace("glmnet", quietly = TRUE) && requireNamespace("crayon", quietly = TRUE)) { set.seed(456) n <- 140 a <- rnorm(n) b <- rnorm(n) f <- factor(sample(c("L", "H"), n, replace = TRUE)) y <- 2 * a + 0.8 * b - 1.2 * (f == "H") + rnorm(n, sd = 0.8) toy <- data.frame(y = y, a = a, b = b, f = f) lasso_obj2 <- select_auxiliary_variables_lasso_cv( df = toy, outcome_vars = "y", auxiliary_vars = c("a", "b", "f"), must_have_vars = "f", # keep factor (its dummies get zero penalty) check_twoway_int = FALSE, # main effects only nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = TRUE, # store cv.glmnet model parallel = FALSE ) print(lasso_obj2) }
This function performs LASSO-penalized regression (logistic regression for binary outcomes or linear regression for continuous outcomes) with cross-validation to select auxiliary variables for modeling one or more outcome variables. It allows for the inclusion of all two-way interactions among the auxiliary variables and the option to force certain variables to remain in the model through the use of zero penalty factors.
select_auxiliary_variables_lasso_cv( df, outcome_vars, auxiliary_vars, must_have_vars = NULL, check_twoway_int = TRUE, nfolds = 5, verbose = TRUE, standardize = TRUE, return_models = FALSE, parallel = FALSE )select_auxiliary_variables_lasso_cv( df, outcome_vars, auxiliary_vars, must_have_vars = NULL, check_twoway_int = TRUE, nfolds = 5, verbose = TRUE, standardize = TRUE, return_models = FALSE, parallel = FALSE )
df |
A data frame containing the data for modeling. |
outcome_vars |
Character vector of outcome variable names to model. These can be either binary or continuous outcomes. Each
must exist in |
auxiliary_vars |
Character vector of auxiliary variable names to be used as predictors. |
must_have_vars |
Optional character vector of variable names that must be included
in the model (penalty factor 0). If interactions are included, any interaction containing
a must-have variable is also assigned zero penalty. The variables in |
check_twoway_int |
Logical; include all two-way interactions among auxiliary variables. Defaults to |
nfolds |
Number of folds for cross-validation. Defaults to 5. |
verbose |
Logical; print progress messages. Defaults to |
standardize |
Logical; standardize predictors before fitting. Defaults to |
return_models |
Logical; return fitted |
parallel |
Logical; run cross-validation in parallel (requires doParallel). Defaults to |
The function supports both binary and continuous outcomes. For binary outcomes, logistic regression is used, and for continuous outcomes, linear regression is used. The function outputs a list with the selected variables across outcomes, the associated lambda values, the goodness-of-fit statistics, and optionally the fitted models and interaction terms.
The function supports two types of outcome variables:
Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.
Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.
For factor variables in auxiliary_vars, dummy variables are created to represent each level of the factor. If a factor variable is specified in must_have_vars,
its dummy variables will be included in the model, ensuring that any interactions containing those variables are also forced into the model.
An object of class "select_auxiliary_variables_lasso_cv" with the following components:
Character vector of variables selected across all outcome models. This includes the main effect variables and any interaction terms.
Named list of character vectors, each containing the selected variables for each outcome.
Named numeric vector of lambda values (specifically, lambda.min) for each outcome.
Named numeric vector with penalty factors (0 for must-keep, 1 otherwise).
List of cv.glmnet objects per outcome if return_models = TRUE, otherwise an empty list.
Named list per outcome with cross-validation metrics (cv_error, cv_error_sd) and full data metrics (deviance_explained for binary outcomes, auc, accuracy, brier_score, rss, mse, r_squared, raw_coefs).
List containing metadata on interaction terms, main effects in interactions, and the full formula used.
## ------------------------------------------------------------ ## Example 1: Binary + continuous outcomes, with interactions ## and must-have variables (factor expanded to dummies) ## ------------------------------------------------------------ set.seed(123) n <- 150 x1 <- rnorm(n) x2 <- rnorm(n) group <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Generate outcomes with some signal in x1, x2 and group, plus an interaction eta_bin <- -0.5 + 1.2 * x2 - 0.8 * (group == "C") + 0.5 * x1 * x2 p <- 1 / (1 + exp(-eta_bin)) y_bin <- rbinom(n, 1, p) y_cont <- 1.5 * x1 - 2 * (group == "B") + 0.7 * x1 * x2 + rnorm(n, sd = 0.7) df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, group = group) res1 <- select_auxiliary_variables_lasso_cv( df = df, outcome_vars = c("y_bin", "y_cont"), auxiliary_vars = c("x1", "x2", "group"), must_have_vars = c("x1", "group"), # 'group' (factor) expands to its dummies check_twoway_int = TRUE, nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = FALSE ) ## Inspect selections and metadata res1$selected_variables res1$by_outcome res1$selected_lambdas names(which(res1$penalty_factors == 0)) # must-keep terms (incl. factor dummies & interactions) res1$interaction_metadata$full_formula ## ------------------------------------------------------------ ## Example 2: Single continuous outcome, main effects only ## ------------------------------------------------------------ set.seed(456) n2 <- 120 a <- rnorm(n2) b <- rnorm(n2) f <- factor(sample(c("a", "b"), n2, replace = TRUE)) y <- 2 * a - 1 * (f == "b") + rnorm(n2, sd = 1) toy <- data.frame(y = y, a = a, b = b, f = f) res2 <- select_auxiliary_variables_lasso_cv( df = toy, outcome_vars = "y", auxiliary_vars = c("a", "b", "f"), check_twoway_int = FALSE, # main effects only nfolds = 3, verbose = FALSE ) res2$selected_variables res2$selected_lambdas res2$goodness_of_fit$y## ------------------------------------------------------------ ## Example 1: Binary + continuous outcomes, with interactions ## and must-have variables (factor expanded to dummies) ## ------------------------------------------------------------ set.seed(123) n <- 150 x1 <- rnorm(n) x2 <- rnorm(n) group <- factor(sample(c("A", "B", "C"), n, replace = TRUE)) ## Generate outcomes with some signal in x1, x2 and group, plus an interaction eta_bin <- -0.5 + 1.2 * x2 - 0.8 * (group == "C") + 0.5 * x1 * x2 p <- 1 / (1 + exp(-eta_bin)) y_bin <- rbinom(n, 1, p) y_cont <- 1.5 * x1 - 2 * (group == "B") + 0.7 * x1 * x2 + rnorm(n, sd = 0.7) df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, group = group) res1 <- select_auxiliary_variables_lasso_cv( df = df, outcome_vars = c("y_bin", "y_cont"), auxiliary_vars = c("x1", "x2", "group"), must_have_vars = c("x1", "group"), # 'group' (factor) expands to its dummies check_twoway_int = TRUE, nfolds = 3, verbose = FALSE, standardize = TRUE, return_models = FALSE ) ## Inspect selections and metadata res1$selected_variables res1$by_outcome res1$selected_lambdas names(which(res1$penalty_factors == 0)) # must-keep terms (incl. factor dummies & interactions) res1$interaction_metadata$full_formula ## ------------------------------------------------------------ ## Example 2: Single continuous outcome, main effects only ## ------------------------------------------------------------ set.seed(456) n2 <- 120 a <- rnorm(n2) b <- rnorm(n2) f <- factor(sample(c("a", "b"), n2, replace = TRUE)) y <- 2 * a - 1 * (f == "b") + rnorm(n2, sd = 1) toy <- data.frame(y = y, a = a, b = b, f = f) res2 <- select_auxiliary_variables_lasso_cv( df = toy, outcome_vars = "y", auxiliary_vars = c("a", "b", "f"), check_twoway_int = FALSE, # main effects only nfolds = 3, verbose = FALSE ) res2$selected_variables res2$selected_lambdas res2$goodness_of_fit$y