Bayesian Vector Autoregressions

Shrinkage, Minnesota Priors, and Posterior Inference

Why Bayesian VARs?

Vector Autoregressions are powerful tools for macroeconomic analysis, but they suffer from a fundamental problem: too many parameters, too little data.

The Curse of Dimensionality

A VAR(p) with K variables has: \[ \text{Parameters} = K^2 p + K = K(Kp + 1) \]

Variables (K)	Lags (p)	Parameters	Typical Macro Sample (T=200)
3	4	39	OK
5	4	105	Marginal
7	4	203	Problematic
10	4	410	Unreliable

With quarterly data since 1960, you have ~250 observations. A 7-variable VAR(4) estimates 203 parameters—almost one parameter per observation!

param_count <- expand.grid(K = 2:12, p = 1:8) %>%
  mutate(params = K * (K * p + 1))

ggplot(param_count, aes(x = K, y = params, color = factor(p))) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_hline(yintercept = c(100, 200), linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 11, y = 110, label = "T = 100", hjust = 0, size = 3) +
  annotate("text", x = 11, y = 210, label = "T = 200", hjust = 0, size = 3) +
  scale_color_viridis_d(option = "plasma") +
  labs(title = "VAR Parameters Explode with System Size",
       subtitle = "Dashed lines show typical macro sample sizes",
       x = "Number of Variables (K)", y = "Number of Parameters",
       color = "Lags (p)")

The OLS Problem

With many parameters and few observations: - Overfitting: In-sample fit is excellent, out-of-sample forecasts are poor - Imprecision: Huge standard errors, wide confidence intervals - Instability: Small changes in data lead to large changes in estimates

The Bayesian Solution: Shrinkage

Key insight: We have prior beliefs about macroeconomic dynamics. Use them!

Reasonable priors for macro VARs: 1. Variables are persistent (close to random walks) 2. Own lags matter more than other variables’ lags 3. Recent lags matter more than distant lags 4. Cross-variable effects are typically small

Bayesian estimation shrinks estimates toward these beliefs, reducing variance at the cost of some bias. With many parameters, this variance-bias tradeoff strongly favors shrinkage.

# Demonstrate shrinkage benefit
set.seed(123)
n_sim <- 1000
true_beta <- 0.3
n_obs <- 30

# OLS estimates (high variance)
ols_estimates <- replicate(n_sim, {
  x <- rnorm(n_obs)
  y <- true_beta * x + rnorm(n_obs, 0, 2)
  coef(lm(y ~ x - 1))
})

# Bayesian estimates with shrinkage prior toward 0
# Prior: beta ~ N(0, 0.5^2)
prior_var <- 0.5^2
bayes_estimates <- replicate(n_sim, {
  x <- rnorm(n_obs)
  y <- true_beta * x + rnorm(n_obs, 0, 2)

  # Posterior with known sigma = 2
  sigma2 <- 4
  post_var <- 1 / (1/prior_var + sum(x^2)/sigma2)
  post_mean <- post_var * (sum(x*y)/sigma2)
  post_mean
})

estimates_df <- data.frame(
  estimate = c(ols_estimates, bayes_estimates),
  method = rep(c("OLS (no shrinkage)", "Bayesian (shrinkage)"), each = n_sim)
)

ggplot(estimates_df, aes(x = estimate, fill = method)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = true_beta, linetype = "dashed", linewidth = 1) +
  annotate("text", x = true_beta + 0.1, y = 1.5,
           label = paste0("True β = ", true_beta), hjust = 0) +
  scale_fill_manual(values = c("OLS (no shrinkage)" = "#e74c3c",
                                "Bayesian (shrinkage)" = "#3498db")) +
  labs(title = "Shrinkage Reduces Estimation Variance",
       subtitle = "Bayesian estimates cluster closer to truth despite small bias",
       x = expression(hat(beta)), y = "Density",
       fill = "") +
  theme(legend.position = "top")

Shrinkage reduces variance at the cost of small bias

The Minnesota Prior

Litterman (1986) and Doan-Litterman-Sims (1984)

The Minnesota prior is the workhorse prior for macroeconomic BVARs. It encodes the belief that macro variables behave approximately like random walks.

Prior Mean

For variables in levels (GDP, prices, etc.):

\[ E[A_1] = I_K, \quad E[A_j] = 0 \text{ for } j > 1 \]

Each variable’s first own lag coefficient is centered at 1 (random walk), all other coefficients at 0.

For stationary variables (growth rates, spreads):

\[ E[A_j] = 0 \text{ for all } j \]

Prior Variance (The Shrinkage Structure)

The key innovation is the structured prior variance:

Own lag \(l\) of variable \(i\): \[ \text{Var}(a_{ii,l}) = \left(\frac{\lambda_1}{l^d}\right)^2 \]

Cross lag \(l\) of variable \(j\) on equation \(i\): \[ \text{Var}(a_{ij,l}) = \left(\frac{\lambda_1 \cdot \lambda_2 \cdot \sigma_i}{l^d \cdot \sigma_j}\right)^2 \]

Constant term: \[ \text{Var}(c_i) = (\lambda_1 \cdot \lambda_3)^2 \]

where: - \(\sigma_i\) = residual std from univariate AR(p) of variable \(i\) - \(\lambda_1\) = overall tightness (controls shrinkage strength) - \(\lambda_2\) = cross-variable shrinkage (typically < 1) - \(\lambda_3\) = constant tightness (typically large, ~100) - \(d\) = lag decay (typically 1 or 2)

Hyperparameter Interpretation

Parameter	Typical Range	Effect
\(\lambda_1\)	0.1 - 0.5	Smaller → more shrinkage toward prior
\(\lambda_2\)	0.3 - 0.5	Smaller → own lags dominate cross effects
\(\lambda_3\)	100	Large → loose prior on constant
\(d\)	1 or 2	Larger → faster decay for distant lags

# Illustrate Minnesota prior variance structure
K <- 3
p <- 4
lambda1 <- 0.2
lambda2 <- 0.5
d <- 1

# Prior variance for different coefficient types
lags <- 1:p
own_var <- (lambda1 / lags^d)^2
cross_var <- (lambda1 * lambda2 / lags^d)^2

var_df <- data.frame(
  lag = rep(lags, 2),
  variance = c(own_var, cross_var),
  type = rep(c("Own lag", "Cross lag"), each = p)
)

ggplot(var_df, aes(x = lag, y = sqrt(variance), color = type)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Own lag" = "#3498db", "Cross lag" = "#e74c3c")) +
  labs(title = "Minnesota Prior: Standard Deviation by Lag",
       subtitle = expression(paste(lambda[1], " = 0.2, ", lambda[2], " = 0.5, d = 1")),
       x = "Lag", y = "Prior Standard Deviation",
       color = "Coefficient Type") +
  theme(legend.position = "top")

The Normal-Inverse-Wishart Prior

The Minnesota prior is typically implemented as a Normal-Inverse-Wishart (NIW) prior:

\[ \begin{aligned} \text{vec}(B) | \Sigma &\sim N(\text{vec}(B_0), \Sigma \otimes V_0) \\ \Sigma &\sim \text{Inverse-Wishart}(S_0, \nu_0) \end{aligned} \]

This is conjugate: the posterior has the same form with updated parameters.

Posterior: \[ \begin{aligned} V_{\text{post}} &= (V_0^{-1} + X'X)^{-1} \\ B_{\text{post}} &= V_{\text{post}}(V_0^{-1} B_0 + X'Y) \\ \nu_{\text{post}} &= \nu_0 + T \\ S_{\text{post}} &= S_0 + (Y - XB_{\text{post}})'(Y - XB_{\text{post}}) + (B_{\text{post}} - B_0)'V_0^{-1}(B_{\text{post}} - B_0) \end{aligned} \]

GLP: Data-Driven Hyperparameters

Giannone, Lenza & Primiceri (2015)

How do we choose \(\lambda_1, \lambda_2\)? The traditional approach: researcher judgment or cross-validation.

GLP innovation: Choose hyperparameters by maximizing the marginal likelihood:

\[ p(Y | \lambda) = \int p(Y | B, \Sigma) \cdot p(B, \Sigma | \lambda) \, dB \, d\Sigma \]

For the NIW prior, this integral has a closed-form solution!

The Algorithm

Define a grid over \(\lambda = (\lambda_1, \lambda_2, \ldots)\)
For each \(\lambda\), compute \(p(Y | \lambda)\) analytically
Choose \(\lambda^* = \arg\max_\lambda p(Y | \lambda)\)
Estimate BVAR with optimal hyperparameters

Why This Matters

Data-driven: No arbitrary hyperparameter choices
Automatic: Adapts to different datasets
Forecast accuracy: GLP BVARs consistently outperform OLS VARs

# Illustrate marginal likelihood as function of lambda
# (Simulated for illustration)
lambda_grid <- seq(0.01, 1, length.out = 50)

# Simulated marginal likelihood (realistic shape)
set.seed(42)
ml_values <- -50 - 20*(lambda_grid - 0.25)^2 + rnorm(50, 0, 2)

ml_df <- data.frame(lambda = lambda_grid, log_ml = ml_values)
optimal_lambda <- lambda_grid[which.max(ml_values)]

ggplot(ml_df, aes(x = lambda, y = log_ml)) +
  geom_line(linewidth = 1, color = "#3498db") +
  geom_vline(xintercept = optimal_lambda, linetype = "dashed", color = "#e74c3c") +
  annotate("point", x = optimal_lambda, y = max(ml_values),
           size = 4, color = "#e74c3c") +
  annotate("text", x = optimal_lambda + 0.05, y = max(ml_values) - 5,
           label = paste0("λ* = ", round(optimal_lambda, 2)), hjust = 0,
           color = "#e74c3c") +
  labs(title = "GLP: Marginal Likelihood Optimization",
       subtitle = "Optimal shrinkage is data-determined",
       x = expression(lambda[1] ~ "(Overall Tightness)"),
       y = "Log Marginal Likelihood")

Additional Priors: Sum-of-Coefficients and No-Cointegration

GLP (2015) also optimizes over additional priors that improve forecasting:

Sum-of-Coefficients (Doan-Litterman-Sims)

Belief: If all variables equal their sample means, the forecast should also be the mean.

Implemented by adding dummy observations. Controlled by hyperparameter \(\mu\).

No-Cointegration (Sims-Zha)

Belief: Variables have unit roots but are not cointegrated.

Also implemented via dummy observations. Controlled by hyperparameter \(\delta\).

GLP jointly optimizes \((\lambda_1, \lambda_2, \mu, \delta)\) via marginal likelihood.

Gibbs Sampling for BVAR

While the NIW prior allows analytical posteriors, we often use Gibbs sampling to: - Easily compute posterior quantities (IRFs, FEVDs) - Extend to non-conjugate priors - Handle structural identification

The Gibbs Sampler

Block 1: Draw \(B | \Sigma, Y\) \[ \text{vec}(B) | \Sigma, Y \sim N(\text{vec}(B_{\text{post}}), \Sigma \otimes V_{\text{post}}) \]

Block 2: Draw \(\Sigma | B, Y\) \[ \Sigma | B, Y \sim \text{Inverse-Wishart}(S_{\text{post}}, \nu_{\text{post}}) \]

Iterate for \(G\) draws, discarding the first \(B_{\text{burn}}\) as burn-in.

# Simplified BVAR Gibbs sampler demonstration
# Generate synthetic VAR(1) data
T_obs <- 100
K <- 2

# True parameters
A_true <- matrix(c(0.8, 0.1, 0.05, 0.7), K, K)
Sigma_true <- matrix(c(1, 0.3, 0.3, 1), K, K)

# Generate data
Y <- matrix(0, T_obs, K)
Y[1, ] <- rnorm(K)
for (t in 2:T_obs) {
  Y[t, ] <- Y[t-1, ] %*% t(A_true) + rmvnorm(1, sigma = Sigma_true)
}

# Build matrices
y <- Y[2:T_obs, ]
X <- cbind(1, Y[1:(T_obs-1), ])
T_eff <- nrow(y)
M <- ncol(X)

# Minnesota-style prior
B0 <- matrix(0, M, K)
B0[2:3, ] <- diag(K) * 0.9  # Prior: close to random walk

# Prior precision (diagonal, tight) - must be (M*K) x (M*K) for vec(B)
lambda1 <- 0.2
# Build prior variance for each coefficient in vec(B)
# Structure: [const_eq1, y1_eq1, y2_eq1, const_eq2, y1_eq2, y2_eq2]
v0_diag <- numeric(M * K)
for (eq in 1:K) {
  idx_const <- (eq - 1) * M + 1
  v0_diag[idx_const] <- 100  # Loose prior on constant
  for (var in 1:K) {
    idx_var <- (eq - 1) * M + 1 + var
    v0_diag[idx_var] <- lambda1^2  # Tight prior on AR coefficients
  }
}
V0_inv <- diag(1 / v0_diag)

# Prior for Sigma
nu0 <- K + 2
S0 <- diag(K)

# Gibbs sampler
n_draw <- 2000
n_burn <- 500
B_draws <- array(0, dim = c(M, K, n_draw))
Sigma_draws <- array(0, dim = c(K, K, n_draw))

# Initialize
B_curr <- solve(crossprod(X)) %*% crossprod(X, y)
Sigma_curr <- crossprod(y - X %*% B_curr) / T_eff

for (g in 1:(n_draw + n_burn)) {
  # Draw B | Sigma using vectorized form
  # vec(B) | Sigma ~ N(b_post, V_post) where V_post = (V0_inv + Sigma^{-1} ⊗ X'X)^{-1}
  Sigma_inv <- solve(Sigma_curr)
  V_post_inv <- V0_inv + kronecker(Sigma_inv, crossprod(X))
  V_post <- solve(V_post_inv)
  b_post <- V_post %*% (V0_inv %*% as.vector(B0) +
                         kronecker(Sigma_inv, t(X)) %*% as.vector(y))
  B_curr <- matrix(rmvnorm(1, b_post, V_post), M, K)

  # Draw Sigma | B
  U <- y - X %*% B_curr
  S_post <- S0 + crossprod(U)
  Sigma_curr <- solve(rWishart(1, nu0 + T_eff, solve(S_post))[,,1])

  # Store
  if (g > n_burn) {
    B_draws[,,g - n_burn] <- B_curr
    Sigma_draws[,,g - n_burn] <- Sigma_curr
  }
}

# Trace plot for A[1,1]
trace_df <- data.frame(
  iteration = 1:n_draw,
  value = B_draws[2, 1, ]  # A[1,1] coefficient
)

ggplot(trace_df, aes(x = iteration, y = value)) +
  geom_line(alpha = 0.7, color = "#3498db") +
  geom_hline(yintercept = A_true[1,1], linetype = "dashed", color = "#e74c3c") +
  geom_hline(yintercept = mean(trace_df$value), linetype = "dotted", color = "#2ecc71") +
  annotate("text", x = n_draw * 0.8, y = A_true[1,1] + 0.05,
           label = paste0("True = ", A_true[1,1]), color = "#e74c3c") +
  labs(title = "Gibbs Sampler: Trace Plot for VAR Coefficient",
       subtitle = "Green = posterior mean, Red = true value",
       x = "Iteration (post burn-in)", y = expression(A[11]))

Gibbs sampler trace plots for BVAR coefficient

Structural Identification in BVAR

Bayesian VARs estimate reduced-form parameters \((B, \Sigma)\). For structural analysis, we need to identify structural shocks.

Cholesky Identification

The simplest approach: impose a recursive structure.

\[ \Sigma = PP', \quad P = \text{chol}(\Sigma, \text{lower}) \]

For each posterior draw \(\Sigma^{(g)}\), compute \(P^{(g)} = \text{chol}(\Sigma^{(g)})\).

The ordering matters: variable ordered first is not affected contemporaneously by others.

Sign Restrictions in BVAR

Bayesian framework naturally handles set-identification:

For each posterior draw \((B^{(g)}, \Sigma^{(g)})\):
1. Compute \(P^{(g)} = \text{chol}(\Sigma^{(g)})\)
2. Draw orthogonal rotation \(Q\) uniformly
3. Candidate impact: \(\tilde{P}^{(g)} = P^{(g)} Q\)
4. Check sign restrictions on IRFs
5. If satisfied: keep. Otherwise: redraw \(Q\)
Report distribution across accepted draws

# Sign restriction: monetary tightening
# - Raises interest rate (positive)
# - Lowers output (negative)
# - Lowers inflation (negative)

restrictions <- data.frame(
  Variable = c("Interest Rate", "Output", "Inflation"),
  Sign = c("+", "-", "-"),
  Interpretation = c("Policy instrument", "Demand falls", "Prices fall")
)

knitr::kable(restrictions,
             caption = "Example Sign Restrictions for Monetary Policy Shock")

Example Sign Restrictions for Monetary Policy Shock
Variable	Sign	Interpretation
Interest Rate	+	Policy instrument
Output	-	Demand falls
Inflation	-	Prices fall

IRFs and FEVDs from Posterior

Computing Posterior IRFs

For each posterior draw \((B^{(g)}, \Sigma^{(g)})\):

Extract coefficient matrices \(A_1^{(g)}, \ldots, A_p^{(g)}\) from \(B^{(g)}\)
Compute structural impact \(P^{(g)}\) (Cholesky or sign-restricted)
Compute IRF at each horizon using MA representation: \[ \Phi_h = \sum_{j=1}^{\min(h,p)} \Phi_{h-j} A_j, \quad \Phi_0 = I \]
Structural IRF: \(\text{IRF}_h = \Phi_h P\)

Report posterior median and credible bands (68%, 95%).

# Compute IRFs from Gibbs draws
H <- 20  # horizons
K <- 2
n_draws_use <- 500  # use subset for speed

IRF_draws <- array(0, dim = c(K, K, H+1, n_draws_use))

for (g in 1:n_draws_use) {
  # Extract A matrix (coefficients on lagged Y)
  A1 <- t(B_draws[2:3, , g])

  # Cholesky impact
  P <- t(chol(Sigma_draws[,,g]))

  # Compute IRFs
  Phi <- diag(K)
  IRF_draws[,,1,g] <- P

  for (h in 1:H) {
    Phi <- Phi %*% A1
    IRF_draws[,,h+1,g] <- Phi %*% P
  }
}

# Extract IRF of variable 1 to shock 1
irf_11 <- IRF_draws[1, 1, , ]

# Posterior summaries
irf_median <- apply(irf_11, 1, median)
irf_lower <- apply(irf_11, 1, quantile, 0.16)
irf_upper <- apply(irf_11, 1, quantile, 0.84)
irf_lower95 <- apply(irf_11, 1, quantile, 0.025)
irf_upper95 <- apply(irf_11, 1, quantile, 0.975)

irf_df <- data.frame(
  horizon = 0:H,
  median = irf_median,
  lower68 = irf_lower,
  upper68 = irf_upper,
  lower95 = irf_lower95,
  upper95 = irf_upper95
)

ggplot(irf_df, aes(x = horizon)) +
  geom_ribbon(aes(ymin = lower95, ymax = upper95), fill = "#3498db", alpha = 0.2) +
  geom_ribbon(aes(ymin = lower68, ymax = upper68), fill = "#3498db", alpha = 0.4) +
  geom_line(aes(y = median), linewidth = 1.2, color = "#2c3e50") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Bayesian VAR: Impulse Response Function",
       subtitle = "Posterior median with 68% and 95% credible bands",
       x = "Horizon", y = "Response of Y1 to Shock 1")

Impulse responses with posterior credible bands

Forecast Error Variance Decomposition

FEVD at horizon \(h\): \[ \text{FEVD}_{i,j}(h) = \frac{\sum_{s=0}^{h} (\text{IRF}_{i,j,s})^2}{\sum_{s=0}^{h} \sum_{k=1}^{K} (\text{IRF}_{i,k,s})^2} \]

Compute for each posterior draw, report median and bands.

Implementation in R

Using the `BVAR` Package

The BVAR package implements Minnesota priors with GLP-style optimization.

library(BVAR)

# Prepare data (T × K matrix)
data_mat <- as.matrix(macro_data[, c("gdp_growth", "inflation", "interest_rate")])

# Minnesota prior with GLP optimization
mn_prior <- bv_minnesota(
  lambda = bv_lambda(mode = 0.2, sd = 0.4, min = 0.0001, max = 5),
  alpha = bv_alpha(mode = 2),   # lag decay
  var = 1e07                     # constant variance
)

# Estimate BVAR
bvar_model <- bvar(
  data = data_mat,
  lags = 4,
  n_draw = 20000,
  n_burn = 5000,
  priors = mn_prior,
  mh = bv_mh(scale_hess = 0.01, adjust_acc = TRUE)
)

# Diagnostics
summary(bvar_model)
plot(bvar_model)  # trace plots

# IRFs (Cholesky identification)
irf_results <- irf(bvar_model, horizon = 20, identification = TRUE)
plot(irf_results)

# FEVD
fevd_results <- fevd(bvar_model, horizon = 20)
plot(fevd_results)

# Forecasting
forecasts <- predict(bvar_model, horizon = 8)
plot(forecasts)

Using the `bvartools` Package

More flexible, allows custom prior specification.

library(bvartools)

# Generate VAR data matrices
bvar_data <- gen_var(data_ts, p = 4, deterministic = "const")
y <- t(bvar_data$data$Y)
x <- t(bvar_data$data$Z)

# Set up Minnesota prior manually
K <- ncol(data_mat)
M <- K * 4 + 1

# Prior mean: random walk on first lag
a_mu_prior <- matrix(0, M, K)
a_mu_prior[2:(K+1), ] <- diag(K)

# Prior precision (Minnesota structure)
a_v_i_prior <- diag(1, M * K) * 0.01

# Prior for Sigma
u_sigma_df_prior <- K + 2
u_sigma_scale_prior <- diag(1, K)

# Estimate
bvar_est <- bvar(
  y = y, x = x,
  A = list(mu = a_mu_prior, V_i = a_v_i_prior),
  Sigma = list(df = u_sigma_df_prior, scale = u_sigma_scale_prior),
  iterations = 20000, burnin = 5000
)

# IRFs
bvar_irf <- irf(bvar_est,
                impulse = "interest_rate",
                response = "inflation",
                n.ahead = 20,
                type = "feir",  # forecast error IRF
                ci = 0.95)
plot(bvar_irf)

Manual Implementation

For full control and understanding:

bvar_gibbs <- function(Y, p, n_draw = 10000, n_burn = 2000,
                        lambda1 = 0.2, lambda2 = 0.5, lambda3 = 100) {
  T_full <- nrow(Y)
  K <- ncol(Y)
  M <- K * p + 1

  # Build data matrices
  y <- Y[(p + 1):T_full, ]
  T_eff <- nrow(y)
  X <- cbind(1)
  for (j in 1:p) {
    X <- cbind(X, Y[(p + 1 - j):(T_full - j), ])
  }

  # Minnesota prior mean
  B0 <- matrix(0, M, K)
  B0[2:(K + 1), ] <- diag(K)  # first lag = identity

  # Scaling from AR residuals
  sigma_i <- numeric(K)
  for (i in 1:K) {
    ar_fit <- ar(Y[, i], order.max = p, method = "ols")
    sigma_i[i] <- sqrt(ar_fit$var.pred)
  }

  # Prior variance (Minnesota structure)
  V0_diag <- numeric(M * K)
  for (eq in 1:K) {
    idx <- (eq - 1) * M + 1
    V0_diag[idx] <- (lambda1 * lambda3)^2  # constant

    for (lag in 1:p) {
      for (var in 1:K) {
        idx <- (eq - 1) * M + 1 + (lag - 1) * K + var
        if (var == eq) {
          V0_diag[idx] <- (lambda1 / lag)^2
        } else {
          V0_diag[idx] <- (lambda1 * lambda2 * sigma_i[eq] / (lag * sigma_i[var]))^2
        }
      }
    }
  }
  V0_inv <- diag(1 / V0_diag)

  # Inverse Wishart prior
  v0 <- K + 2
  S0 <- diag(sigma_i^2)

  # Storage and initialization
  B_draws <- array(0, dim = c(M, K, n_draw))
  Sigma_draws <- array(0, dim = c(K, K, n_draw))
  B_curr <- solve(crossprod(X)) %*% crossprod(X, y)
  Sigma_curr <- crossprod(y - X %*% B_curr) / T_eff

  # Gibbs sampler
  for (g in 1:(n_draw + n_burn)) {
    # Block 1: B | Sigma
    Sigma_inv <- solve(Sigma_curr)
    V_post_inv <- V0_inv + kronecker(Sigma_inv, crossprod(X))
    V_post <- solve(V_post_inv)
    b_post <- V_post %*% (V0_inv %*% as.vector(B0) +
                           kronecker(Sigma_inv, t(X)) %*% as.vector(y))
    B_curr <- matrix(mvtnorm::rmvnorm(1, b_post, V_post), M, K)

    # Block 2: Sigma | B
    U <- y - X %*% B_curr
    S_post <- S0 + crossprod(U)
    v_post <- v0 + T_eff
    Sigma_curr <- solve(rWishart(1, v_post, solve(S_post))[,,1])

    if (g > n_burn) {
      B_draws[,,g - n_burn] <- B_curr
      Sigma_draws[,,g - n_burn] <- Sigma_curr
    }
  }

  list(B = B_draws, Sigma = Sigma_draws, K = K, p = p)
}

Implementation in Python

For those who prefer Python, here are equivalent implementations.

Using NumPy/SciPy (Manual)

import numpy as np
from scipy import stats
from scipy.linalg import cholesky, solve

def bvar_gibbs(Y, p, n_draw=10000, n_burn=2000,
               lambda1=0.2, lambda2=0.5, lambda3=100):
    """
    Bayesian VAR with Minnesota prior via Gibbs sampling.

    Parameters
    ----------
    Y : ndarray (T, K)
        Data matrix
    p : int
        Lag order
    n_draw : int
        Number of posterior draws
    n_burn : int
        Burn-in period
    lambda1, lambda2, lambda3 : float
        Minnesota prior hyperparameters

    Returns
    -------
    dict with posterior draws of B and Sigma
    """
    T_full, K = Y.shape
    M = K * p + 1

    # Build data matrices
    y = Y[p:]
    T_eff = len(y)

    X = np.ones((T_eff, 1))
    for j in range(1, p + 1):
        X = np.hstack([X, Y[p-j:T_full-j]])

    # Minnesota prior mean (random walk on first lag)
    B0 = np.zeros((M, K))
    B0[1:K+1, :] = np.eye(K)

    # AR residual scaling
    sigma_i = np.zeros(K)
    for i in range(K):
        ar_resid = Y[p:, i] - Y[p-1:-1, i] * 0.9  # approximate
        sigma_i[i] = np.std(ar_resid)

    # Prior variance (Minnesota structure)
    V0_diag = np.zeros(M * K)
    for eq in range(K):
        idx = eq * M
        V0_diag[idx] = (lambda1 * lambda3)**2

        for lag in range(1, p + 1):
            for var in range(K):
                idx = eq * M + 1 + (lag - 1) * K + var
                if var == eq:
                    V0_diag[idx] = (lambda1 / lag)**2
                else:
                    V0_diag[idx] = (lambda1 * lambda2 * sigma_i[eq] /
                                    (lag * sigma_i[var]))**2

    V0_inv = np.diag(1 / V0_diag)

    # Inverse Wishart prior
    v0 = K + 2
    S0 = np.diag(sigma_i**2)

    # Initialize
    B_curr = np.linalg.lstsq(X, y, rcond=None)[0]
    U = y - X @ B_curr
    Sigma_curr = U.T @ U / T_eff

    # Storage
    B_draws = np.zeros((M, K, n_draw))
    Sigma_draws = np.zeros((K, K, n_draw))

    for g in range(n_draw + n_burn):
        # Block 1: Draw B | Sigma
        Sigma_inv = np.linalg.inv(Sigma_curr)
        V_post_inv = V0_inv + np.kron(Sigma_inv, X.T @ X)
        V_post = np.linalg.inv(V_post_inv)

        b0_vec = B0.flatten('F')
        xy_vec = (X.T @ y @ Sigma_inv).flatten('F')
        b_post = V_post @ (V0_inv @ b0_vec + xy_vec)

        B_curr = np.random.multivariate_normal(b_post, V_post).reshape((M, K), order='F')

        # Block 2: Draw Sigma | B
        U = y - X @ B_curr
        S_post = S0 + U.T @ U
        v_post = v0 + T_eff

        # Inverse Wishart draw
        Sigma_curr = stats.invwishart.rvs(df=v_post, scale=S_post)

        if g >= n_burn:
            B_draws[:, :, g - n_burn] = B_curr
            Sigma_draws[:, :, g - n_burn] = Sigma_curr

    return {'B': B_draws, 'Sigma': Sigma_draws, 'K': K, 'p': p}


def compute_irfs(bvar_result, shock_idx, H=20):
    """
    Compute IRFs from BVAR posterior draws.
    """
    B_draws = bvar_result['B']
    Sigma_draws = bvar_result['Sigma']
    K = bvar_result['K']
    p = bvar_result['p']
    n_draw = B_draws.shape[2]

    IRF_draws = np.zeros((K, H + 1, n_draw))

    for g in range(n_draw):
        # Extract A matrices
        A = [B_draws[1 + j*K:1 + (j+1)*K, :, g].T for j in range(p)]

        # Cholesky impact
        P = cholesky(Sigma_draws[:, :, g], lower=True)

        # Compute IRFs
        Phi = np.eye(K)
        IRF_draws[:, 0, g] = P[:, shock_idx]

        for h in range(1, H + 1):
            Phi = Phi @ A[0] if p >= 1 else np.zeros((K, K))
            IRF_draws[:, h, g] = Phi @ P[:, shock_idx]

    return {
        'median': np.median(IRF_draws, axis=2),
        'lower': np.percentile(IRF_draws, 16, axis=2),
        'upper': np.percentile(IRF_draws, 84, axis=2)
    }

Using PyMC (Probabilistic Programming)

import pymc as pm
import numpy as np
import arviz as az

def bvar_pymc(Y, p, n_draw=2000, n_tune=1000):
    """
    Bayesian VAR using PyMC.
    """
    T_full, K = Y.shape

    # Build data matrices
    y = Y[p:]
    T_eff = len(y)

    X = np.ones((T_eff, 1))
    for j in range(1, p + 1):
        X = np.hstack([X, Y[p-j:T_full-j]])
    M = X.shape[1]

    with pm.Model() as bvar_model:
        # Priors
        # Coefficients: Normal with Minnesota-style shrinkage
        B = pm.Normal('B', mu=0, sigma=0.5, shape=(M, K))

        # Covariance: LKJ prior on correlation + half-normal on scales
        chol, corr, stds = pm.LKJCholeskyCov(
            'chol', n=K, eta=2, sd_dist=pm.HalfNormal.dist(sigma=1)
        )
        Sigma = pm.Deterministic('Sigma', chol @ chol.T)

        # Likelihood
        mu = pm.math.dot(X, B)
        y_obs = pm.MvNormal('y', mu=mu, chol=chol, observed=y)

        # Sample
        trace = pm.sample(n_draw, tune=n_tune, cores=2,
                          return_inferencedata=True)

    return trace

# Usage:
# trace = bvar_pymc(Y, p=4)
# az.plot_trace(trace, var_names=['B'])
# az.summary(trace, var_names=['B', 'Sigma'])

Summary

Key takeaways:

Why Bayesian VAR: Shrinkage prevents overfitting when parameters outnumber observations
Minnesota Prior: Encodes beliefs that macro variables are persistent, own lags dominate, recent lags matter more
GLP Optimization: Data-driven hyperparameter selection via marginal likelihood maximization
Gibbs Sampling: Iteratively draw \(B|\Sigma\) and \(\Sigma|B\) from conditional posteriors
Structural Identification: Cholesky (recursive) or sign restrictions, applied to each posterior draw
Reporting: Present posterior median IRFs with 68% and 95% credible bands
Software:
- R: BVAR (GLP optimization), bvartools (flexible), bvarsv (TVP-VAR)
- Python: Manual with NumPy/SciPy, or PyMC for probabilistic programming

Key References

Foundational:

Litterman (1986). “Forecasting with Bayesian VARs.” JBES
Doan, Litterman & Sims (1984). “Forecasting and Conditional Projection.”
Sims & Zha (1998). “Bayesian Methods for Dynamic Multivariate Models.” IER

Modern Methods:

Giannone, Lenza & Primiceri (2015). “Prior Selection for VARs.” REStat
Banbura, Giannone & Reichlin (2010). “Large Bayesian VARs.” JAE
Carriero, Clark & Marcellino (2019). “Large BVARs with Stochastic Volatility.” JoE

Textbooks:

Koop (2003). Bayesian Econometrics
Blake & Mumtaz (2012). Applied Bayesian Econometrics for Central Bankers

R Packages:

BVAR: Kuschnig & Vashold — Minnesota prior with GLP
bvartools: Mohr — Flexible BVAR toolkit
bvarsv: Krueger — TVP-VAR with stochastic volatility

Next: Module 9: Bayesian Panel Methods