The factorselect package implements six estimators for
determining the number of factors in large dimensional approximate
factor models. The estimators differ in their theoretical assumptions,
computational approach, and finite sample performance.
The recommended estimator for most applications is the Ahn and Horenstein (2013) eigenvalue ratio estimator, which is robust to perturbations in the eigenvalue spectrum and performs well when only one of N or T is large.
The package includes a helper function for simulating data from a static approximate factor model:
\[X = F \Lambda' + E\]
where \(F\) is a \(T \times k\) matrix of factors, \(\Lambda\) is an \(N \times k\) matrix of loadings, and \(E\) is an \(N \times T\) matrix of idiosyncratic errors.
The eigenvalue ratio (ER) and growth ratio (GR) estimators of Ahn and Horenstein (2013) are obtained by maximizing the ratio of adjacent eigenvalues of the sample covariance matrix. The ratio approach provides robustness to perturbations in the eigenvalue spectrum.
A key advantage over Bai and Ng (2002) is that the Ahn-Horenstein estimator works well when only one of N or T is large, not requiring both dimensions to grow simultaneously.
All six estimators can be run simultaneously by passing a vector of method names:
result_all <- select_factors(
X,
method = c("ahn_horenstein", "bai_ng", "abc",
"lam_yao", "onatski_2009", "onatski_2010"),
kmax = 8
)
print(result_all)
#> Factor Number Selection
#> =======================
#> Call: select_factors(X = X, method = c("ahn_horenstein", "bai_ng",
#> "abc", "lam_yao", "onatski_2009", "onatski_2010"), kmax = 8)
#>
#> kmax: 8
#>
#> Estimated number of factors:
#> ahn_horenstein 3
#> bai_ng 3
#> abc 3
#> lam_yao 6
#> onatski_2009 3
#> onatski_2010 3The plot method produces a scree plot of the leading
eigenvalues with the selected number of factors marked for each
estimator:
result_ah <- select_factors(X, method = "ahn_horenstein", kmax = 8)
plot(result_ah, main = "Scree Plot — Ahn & Horenstein (2013)")To illustrate the finite sample performance of the estimators, we run a small simulation study with 100 replications across three sample size configurations.
set.seed(123)
n_reps <- 100
k_true <- 3
configs <- list(
large_both = list(N = 100, TT = 200),
small_N = list(N = 25, TT = 200),
small_T = list(N = 200, TT = 25)
)
results <- lapply(configs, function(cfg) {
estimates <- replicate(n_reps, {
X <- simulate_factor_model(N = cfg$N, TT = cfg$TT,
k = k_true, sd = 0.5)
res <- select_factors(X,
method = c("ahn_horenstein", "bai_ng",
"onatski_2010"),
kmax = 8)
res$k
})
rowMeans(estimates == k_true)
})
# Percentage correct for each configuration
do.call(rbind, lapply(names(results), function(nm) {
data.frame(
config = nm,
ahn_horenstein = round(results[[nm]]["ahn_horenstein"] * 100),
bai_ng = round(results[[nm]]["bai_ng"] * 100),
onatski_2010 = round(results[[nm]]["onatski_2010"] * 100)
)
}))
#> config ahn_horenstein bai_ng onatski_2010
#> ahn_horenstein large_both 100 100 87
#> ahn_horenstein1 small_N 100 100 60
#> ahn_horenstein2 small_T 100 0 96The simulation confirms that Ahn and Horenstein (2013) performs well across all three configurations, including when only one dimension is large. Bai and Ng (2002) tends to be less reliable in the asymmetric sample size cases.
These estimators use unstandardized data internally. The
select_factors function handles this automatically — users
do not need to preprocess data differently when requesting these
methods.
This estimator uses lagged auto-covariance matrices rather than the
contemporaneous covariance matrix. The number of lags h
defaults to 1 but can be adjusted:
This estimator performs a sequential hypothesis test. The
significance level alpha defaults to 0.05 but can be
adjusted:
The edge distribution estimator uses an iterative calibration procedure to estimate the threshold separating systematic from idiosyncratic eigenvalues:
Ahn, S.C. and Horenstein, A.R. (2013). Eigenvalue Ratio Test for the Number of Factors. Econometrica, 81(3), 1203-1227.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. Econometrica, 70(1), 191-221.
Alessi, L., Barigozzi, M. and Capasso, M. (2010). Improved Penalization for Determining the Number of Factors in Approximate Factor Models. Statistics and Probability Letters, 80, 1806-1813.
Lam, C. and Yao, Q. (2012). Factor Modelling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics, 40(2), 694-726.
Onatski, A. (2009). Testing Hypotheses About the Number of Factors in Large Factor Models. Econometrica, 77(5), 1447-1479.
Onatski, A. (2010). Determining the Number of Factors From Empirical Distribution of Eigenvalues. The Review of Economics and Statistics, 92(4), 1004-1016.