Regularization for Macro

LASSO, Ridge, Elastic Net, and High-Dimensional Methods

The High-Dimensional Challenge

Modern macroeconomics faces a dimensionality problem:

  • Many potential predictors (hundreds of macro indicators)
  • Short time series (T = 50-200 quarters)
  • Traditional methods overfit: \(K > T\) makes OLS infeasible
NoteThe Core Trade-off

Bias-variance trade-off: More flexible models fit training data better but may generalize poorly. Regularization adds bias to reduce variance.

When Regularization Matters

Setting Problem Solution
Large VARs K variables → K²p parameters Minnesota prior (shrinkage)
Forecasting Many predictors, short samples LASSO/Ridge selection
Factor models High-dimensional data PCA, factor extraction
Causal inference Many potential confounders Double LASSO

Ridge Regression

The Idea

Add an \(\ell_2\) penalty to prevent coefficients from exploding:

\[ \hat{\beta}^{\text{ridge}} = \arg\min_\beta \left\{ \sum_{i=1}^n (y_i - x_i'\beta)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\} \]

Equivalently: \[ \hat{\beta}^{\text{ridge}} = (X'X + \lambda I)^{-1} X'y \]

Key Properties

Property Implication
Shrinks toward zero All coefficients are shrunk
Never exactly zero Does NOT do variable selection
Continuous Stable, smooth as function of \(\lambda\)
Handles multicollinearity \((X'X + \lambda I)\) always invertible
Code
# Simulate correlated predictors
n <- 100
p <- 20
X <- matrix(rnorm(n * p), n, p)
# Add correlation
for (j in 2:p) X[, j] <- 0.7 * X[, j-1] + 0.3 * X[, j]

# True coefficients: sparse
beta_true <- c(3, -2, 0, 0, 1.5, rep(0, p - 5))
y <- X %*% beta_true + rnorm(n, 0, 2)

# Ridge path
ridge_fit <- glmnet(X, y, alpha = 0)  # alpha = 0 is Ridge

# Extract coefficients along lambda path
coef_matrix <- as.matrix(coef(ridge_fit)[-1, ])  # remove intercept
lambda_seq <- ridge_fit$lambda

ridge_df <- as_tibble(t(coef_matrix)) %>%
  mutate(lambda = lambda_seq) %>%
  pivot_longer(-lambda, names_to = "variable", values_to = "coefficient")

ggplot(ridge_df, aes(x = log(lambda), y = coefficient, color = variable)) +
  geom_line(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Ridge Regularization Path",
       subtitle = "Coefficients shrink toward zero as lambda increases",
       x = expression(log(lambda)), y = "Coefficient") +
  theme_minimal() +
  theme(legend.position = "none")

Ridge regularization path

Connection to Bayesian Priors

Ridge regression is equivalent to Bayesian regression with: \[ \beta_j \sim N(0, \tau^2), \quad \tau^2 = \frac{\sigma^2}{\lambda} \]

This is exactly the Minnesota prior intuition: shrink toward zero with tightness \(\lambda\).

LASSO

The Idea

Replace \(\ell_2\) penalty with \(\ell_1\) (absolute value):

\[ \hat{\beta}^{\text{LASSO}} = \arg\min_\beta \left\{ \sum_{i=1}^n (y_i - x_i'\beta)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\} \]

Key Properties

Property Implication
Shrinks to exactly zero Automatic variable selection
Sparse solutions Most coefficients are zero
Non-smooth Path has kinks (not differentiable)
Selects one from correlated group Arbitrary among correlated predictors
Code
# LASSO path
lasso_fit <- glmnet(X, y, alpha = 1)  # alpha = 1 is LASSO

coef_lasso <- as.matrix(coef(lasso_fit)[-1, ])
lambda_lasso <- lasso_fit$lambda

lasso_df <- as_tibble(t(coef_lasso)) %>%
  mutate(lambda = lambda_lasso) %>%
  pivot_longer(-lambda, names_to = "variable", values_to = "coefficient")

# Compare at specific lambda
lambda_compare <- 0.5

ridge_coef <- coef(ridge_fit, s = lambda_compare)[-1]
lasso_coef <- coef(lasso_fit, s = lambda_compare)[-1]

compare_df <- tibble(
  variable = paste0("V", 1:p),
  Ridge = as.vector(ridge_coef),
  LASSO = as.vector(lasso_coef),
  True = beta_true
) %>%
  pivot_longer(c(Ridge, LASSO, True), names_to = "method", values_to = "coefficient")

ggplot(compare_df, aes(x = variable, y = coefficient, fill = method)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
  scale_fill_manual(values = c("#e74c3c", "#3498db", "#2ecc71")) +
  labs(title = paste0("Ridge vs LASSO at lambda = ", lambda_compare),
       subtitle = "LASSO sets many coefficients exactly to zero",
       x = "Variable", y = "Coefficient", fill = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

LASSO vs Ridge: sparsity
Code
ggplot(lasso_df, aes(x = log(lambda), y = coefficient, color = variable)) +
  geom_line(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "LASSO Regularization Path",
       subtitle = "Coefficients hit exactly zero at different lambda values",
       x = expression(log(lambda)), y = "Coefficient") +
  theme_minimal() +
  theme(legend.position = "none")

LASSO path: coefficients drop to exactly zero

Elastic Net

Combining Ridge and LASSO

\[ \hat{\beta}^{\text{EN}} = \arg\min_\beta \left\{ \sum_{i=1}^n (y_i - x_i'\beta)^2 + \lambda \left[ \alpha \sum_j |\beta_j| + (1-\alpha) \sum_j \beta_j^2 \right] \right\} \]

\(\alpha\) Method
0 Ridge
1 LASSO
0.5 Elastic Net (balanced)

When to Use Elastic Net

  • Correlated predictors: LASSO arbitrarily selects one; Elastic Net keeps groups together
  • \(p > n\): LASSO selects at most \(n\) variables; Elastic Net can select more
  • Default in practice: Often use \(\alpha = 0.5\) as a robust choice
Code
alpha_values <- c(0, 0.5, 1)
en_fits <- map(alpha_values, ~glmnet(X, y, alpha = .x))

en_cv <- map(alpha_values, ~cv.glmnet(X, y, alpha = .x))
optimal_lambdas <- map_dbl(en_cv, ~.x$lambda.min)

en_coefs <- map2_dfr(en_fits, alpha_values, function(fit, a) {
  lambda_opt <- optimal_lambdas[which(alpha_values == a)]
  coef_vec <- coef(fit, s = lambda_opt)[-1]
  tibble(
    variable = paste0("V", 1:p),
    coefficient = as.vector(coef_vec),
    alpha = paste0("alpha = ", a)
  )
})

ggplot(en_coefs, aes(x = variable, y = coefficient, fill = alpha)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Elastic Net: Effect of Mixing Parameter",
       subtitle = "alpha = 0 (Ridge), 0.5 (Elastic Net), 1 (LASSO)",
       x = "Variable", y = "Coefficient at optimal lambda", fill = NULL) +
  scale_fill_manual(values = c("#3498db", "#9b59b6", "#e74c3c")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

Elastic Net for different alpha values

Choosing Lambda: Cross-Validation

K-Fold Cross-Validation

  1. Split data into K folds
  2. For each \(\lambda\):
    • Train on K-1 folds, predict on held-out fold
    • Repeat K times
    • Average prediction error
  3. Choose \(\lambda\) that minimizes CV error
Code
# Cross-validation for LASSO
cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10)

cv_df <- tibble(
  log_lambda = log(cv_lasso$lambda),
  cvm = cv_lasso$cvm,
  cvup = cv_lasso$cvup,
  cvlo = cv_lasso$cvlo
)

ggplot(cv_df, aes(x = log_lambda, y = cvm)) +
  geom_point(color = "#e74c3c") +
  geom_errorbar(aes(ymin = cvlo, ymax = cvup), alpha = 0.3) +
  geom_vline(xintercept = log(cv_lasso$lambda.min), linetype = "dashed", color = "#3498db") +
  geom_vline(xintercept = log(cv_lasso$lambda.1se), linetype = "dotted", color = "#2ecc71") +
  annotate("text", x = log(cv_lasso$lambda.min) + 0.5, y = max(cv_df$cvm) * 0.9,
           label = "lambda.min", color = "#3498db") +
  annotate("text", x = log(cv_lasso$lambda.1se) + 0.5, y = max(cv_df$cvm) * 0.8,
           label = "lambda.1se", color = "#2ecc71") +
  labs(title = "10-Fold Cross-Validation for LASSO",
       subtitle = "lambda.min = minimum CV error; lambda.1se = most parsimonious within 1 SE",
       x = expression(log(lambda)), y = "Mean Cross-Validation Error") +
  theme_minimal()

Cross-validation for lambda selection

lambda.min vs lambda.1se

Choice Description When to Use
lambda.min Minimizes CV error Maximum predictive power
lambda.1se Largest \(\lambda\) within 1 SE of min More parsimonious, often preferred

Double/Debiased LASSO

The Problem with Naive LASSO for Inference

LASSO gives biased estimates and invalid standard errors because:

  1. Regularization bias: Shrinkage toward zero
  2. Model selection uncertainty: Which variables are “truly” zero?
  3. No valid confidence intervals in the classical sense

The Solution: Double Selection

For causal effect of \(D\) on \(Y\) with high-dimensional controls \(X\):

Step 1: LASSO of \(Y\) on \(X\) → select controls \(\hat{S}_Y\)

Step 2: LASSO of \(D\) on \(X\) → select controls \(\hat{S}_D\)

Step 3: OLS of \(Y\) on \(D\) and \(\hat{S}_Y \cup \hat{S}_D\)

This is the Belloni, Chernozhukov, Hansen (2014) approach.

Code
# Simulate causal setting
n <- 200
p <- 50  # many potential confounders

# Generate confounders
X <- matrix(rnorm(n * p), n, p)

# Treatment depends on some confounders
D <- 0.5 * X[, 1] + 0.3 * X[, 2] + 0.2 * X[, 5] + rnorm(n, 0, 1)

# Outcome depends on treatment and confounders
tau_true <- 2  # true causal effect
Y <- tau_true * D + 1.5 * X[, 1] + 0.8 * X[, 3] + 0.5 * X[, 5] + rnorm(n, 0, 1)

# Naive OLS (omitting confounders)
naive_ols <- lm(Y ~ D)
cat("Naive OLS (biased):", round(coef(naive_ols)[2], 3), "\n")
Naive OLS (biased): 2.581 
Code
# Double LASSO
# Step 1: Y on X
lasso_Y <- cv.glmnet(X, Y, alpha = 1)
selected_Y <- which(coef(lasso_Y, s = "lambda.1se")[-1] != 0)

# Step 2: D on X
lasso_D <- cv.glmnet(X, D, alpha = 1)
selected_D <- which(coef(lasso_D, s = "lambda.1se")[-1] != 0)

# Step 3: Union and OLS
selected_union <- unique(c(selected_Y, selected_D))
X_selected <- X[, selected_union, drop = FALSE]
double_lasso <- lm(Y ~ D + X_selected)

cat("Double LASSO:", round(coef(double_lasso)[2], 3), "\n")
Double LASSO: 1.976 
Code
cat("True effect:", tau_true, "\n")
True effect: 2 
Code
cat("Selected variables:", length(selected_union), "\n")
Selected variables: 4 
Code
# Compare estimates
comparison_df <- tibble(
  Method = c("Naive OLS", "Double LASSO", "Oracle OLS"),
  Estimate = c(
    coef(naive_ols)[2],
    coef(double_lasso)[2],
    coef(lm(Y ~ D + X[, c(1, 3, 5)]))[2]
  ),
  SE = c(
    summary(naive_ols)$coefficients[2, 2],
    summary(double_lasso)$coefficients[2, 2],
    summary(lm(Y ~ D + X[, c(1, 3, 5)]))$coefficients[2, 2]
  )
)

ggplot(comparison_df, aes(x = Method, y = Estimate)) +
  geom_point(size = 4, color = "#3498db") +
  geom_errorbar(aes(ymin = Estimate - 1.96 * SE, ymax = Estimate + 1.96 * SE),
                width = 0.2, color = "#3498db") +
  geom_hline(yintercept = tau_true, linetype = "dashed", color = "#e74c3c") +
  annotate("text", x = 3.3, y = tau_true + 0.1, label = "True effect", color = "#e74c3c") +
  labs(title = "Double LASSO vs Naive OLS",
       subtitle = "Double LASSO reduces omitted variable bias",
       y = "Estimated Treatment Effect") +
  theme_minimal()

Double LASSO for causal inference

The hdm Package in R

library(hdm)

# Double selection
ds_result <- rlassoEffect(x = X, y = Y, d = D, method = "double selection")
summary(ds_result)

# Partialling out
po_result <- rlassoEffect(x = X, y = Y, d = D, method = "partialling out")
summary(po_result)

Applications in Macro

Large Bayesian VARs (Banbura, Giannone, Reichlin 2010)

Problem: K-variable VAR(p) has \(K^2 p\) parameters.

Solution: Tighter Minnesota prior as dimension grows

\[ \lambda_1^{\text{large}} = \lambda_1^{\text{small}} \times \frac{K_{\text{small}}}{K_{\text{large}}} \]

This is implicit Ridge regularization.

Factor-Augmented VAR (FAVAR)

Extract factors from large dataset, include in VAR:

\[ \begin{pmatrix} F_t \\ Y_t \end{pmatrix} = A \begin{pmatrix} F_{t-1} \\ Y_{t-1} \end{pmatrix} + u_t \]

where \(F_t\) are principal components of hundreds of macro series.

Sparse Granger Causality

Use LASSO to identify which variables Granger-cause others in high-dimensional settings.

Code
# Simulate sparse VAR(1)
K <- 10  # variables
T_sim <- 200

# True VAR coefficient matrix (sparse)
A_true <- matrix(0, K, K)
A_true[1, 1] <- 0.7
A_true[2, 2] <- 0.6
A_true[3, 1] <- 0.3  # 1 → 3
A_true[4, 2] <- 0.25 # 2 → 4
A_true[5, 5] <- 0.8

# Simulate
Y_var <- matrix(0, T_sim, K)
for (t in 2:T_sim) {
  Y_var[t, ] <- A_true %*% Y_var[t-1, ] + rnorm(K, 0, 0.5)
}

# Estimate with LASSO (equation by equation)
lasso_A <- matrix(0, K, K)
for (k in 1:K) {
  y_k <- Y_var[-1, k]
  X_lag <- Y_var[-T_sim, ]
  cv_fit <- cv.glmnet(X_lag, y_k, alpha = 1)
  lasso_A[k, ] <- coef(cv_fit, s = "lambda.1se")[-1]
}

# Compare
comparison_var <- tibble(
  i = rep(1:K, each = K),
  j = rep(1:K, K),
  True = as.vector(t(A_true)),
  LASSO = as.vector(t(lasso_A))
) %>%
  filter(True != 0 | abs(LASSO) > 0.05)

ggplot(comparison_var, aes(x = True, y = LASSO)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "#e74c3c") +
  labs(title = "Sparse VAR: True vs LASSO Estimates",
       subtitle = "LASSO recovers non-zero coefficients, shrinks others to zero",
       x = "True Coefficient", y = "LASSO Estimate") +
  theme_minimal()

Sparse VAR: identifying causal links

Practical Implementation

The glmnet Workflow

library(glmnet)

# Standardize predictors (glmnet does this internally)
X <- scale(X)

# Ridge (alpha = 0)
ridge_fit <- glmnet(X, y, alpha = 0)
cv_ridge <- cv.glmnet(X, y, alpha = 0)
coef(cv_ridge, s = "lambda.min")

# LASSO (alpha = 1)
lasso_fit <- glmnet(X, y, alpha = 1)
cv_lasso <- cv.glmnet(X, y, alpha = 1)
coef(cv_lasso, s = "lambda.1se")

# Elastic Net (alpha = 0.5)
en_fit <- glmnet(X, y, alpha = 0.5)
cv_en <- cv.glmnet(X, y, alpha = 0.5)

# Predictions
predict(cv_lasso, newx = X_new, s = "lambda.min")

Choosing Alpha

Use nested cross-validation:

# Grid search over alpha
alpha_grid <- seq(0, 1, 0.1)
cv_errors <- sapply(alpha_grid, function(a) {
  cv_fit <- cv.glmnet(X, y, alpha = a)
  min(cv_fit$cvm)
})
best_alpha <- alpha_grid[which.min(cv_errors)]

Time Series Considerations

Standard CV is problematic for time series (breaks temporal structure).

Solutions:

  1. Rolling window CV: Train on \(t=1,\ldots,s\), test on \(s+1,\ldots,s+h\)
  2. Expanding window CV: Train on \(t=1,\ldots,s\), test on \(s+1\), then expand
  3. Block bootstrap: Resample blocks to preserve autocorrelation
Code
# Illustrate rolling window CV
T_total <- 100
train_size <- 60
h <- 10  # horizon

cv_scheme <- tibble(
  fold = rep(1:3, each = T_total),
  t = rep(1:T_total, 3),
  set = NA_character_
)

for (f in 1:3) {
  train_end <- train_size + (f - 1) * h
  test_end <- train_end + h
  cv_scheme <- cv_scheme %>%
    mutate(set = case_when(
      fold == f & t <= train_end ~ "Train",
      fold == f & t > train_end & t <= test_end ~ "Test",
      TRUE ~ set
    ))
}

ggplot(cv_scheme %>% filter(!is.na(set)), aes(x = t, y = factor(fold), fill = set)) +
  geom_tile() +
  scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
  labs(title = "Rolling Window Cross-Validation",
       subtitle = "Train window moves forward; test window follows",
       x = "Time", y = "CV Fold", fill = NULL) +
  theme_minimal()

Time series cross-validation schemes

Common Pitfalls

WarningWatch Out For
  1. Post-selection inference: Don’t report OLS standard errors after LASSO selection
  2. Standardization: Always standardize predictors for fair penalization
  3. Time series CV: Don’t use random splits for time series
  4. Interpretation: LASSO coefficients are shrunken—not the “true” effects
  5. Correlated predictors: LASSO picks arbitrarily; consider Elastic Net
  6. Oracle property: LASSO is consistent under stringent conditions (irrepresentable condition)

Summary

Method Penalty Selection When to Use
Ridge \(\ell_2\) No Multicollinearity, all predictors matter
LASSO \(\ell_1\) Yes Sparse truth, variable selection
Elastic Net Both Yes Correlated predictors, \(p > n\)
Double LASSO \(\ell_1\) Yes Causal inference with many controls
TipFor Macro Applications
  • Forecasting: LASSO/Ridge with rolling CV
  • Large VARs: Minnesota prior (implicit Ridge)
  • Causal inference: Double LASSO for high-dimensional controls
  • Model selection: Elastic Net as robust default

Key References

Regularization Methods

  • Tibshirani (1996) “Regression Shrinkage and Selection via the Lasso” JRSS-B
  • Zou & Hastie (2005) “Regularization and Variable Selection via the Elastic Net” JRSS-B
  • Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity — Comprehensive textbook

High-Dimensional Inference

  • Belloni, Chernozhukov & Hansen (2014) “Inference on Treatment Effects after Selection” RES — Double LASSO
  • Chernozhukov et al. (2018) “Double/Debiased Machine Learning” Econometrics Journal
  • Athey & Imbens (2019) “Machine Learning Methods That Economists Should Know About” Annual Review

Macro Applications

  • Banbura, Giannone & Reichlin (2010) “Large Bayesian VARs” JAE
  • Stock & Watson (2012) “Generalized Shrinkage Methods for Forecasting” JAE
  • Giannone, Lenza & Primiceri (2021) “Economic Predictions with Big Data” JAE

Software

  • glmnet: Friedman, Hastie & Tibshirani — R package for penalized regression
  • hdm: Chernozhukov et al. — High-dimensional metrics in R