The activity of a visual meteor shower is a smooth, a priori unknown function of solar longitude. The classical evaluation summarises this by computing the Zenithal Hourly Rate (ZHR) per observation interval and plotting it as a noisy stepped curve. Single-interval ZHRs have high variance — Poisson noise plus observer-to-observer scatter — and the underlying activity profile is barely visible.
A spline-based fit across solar longitude collapses this noise into a
smooth, bias-controlled curve, provided that the spline’s complexity
(number and location of interior knots) is chosen wisely.
select_knots() automates that choice: given a candidate
grid of possible interior knots and a user-supplied score function (BIC,
AIC, or a cross-validation criterion), it returns a small subset that
minimises the score.
Conceptually, the supplied knot_candidates form a
grid of intervals over the solar longitude: between two
adjacent candidates lies an interval in which the spline can bend
independently. select_knots() decides which interval
boundaries are actually required — i.e. which positions carry enough
curvature to justify a knot under the score. The natural resolution of
the input grid in visual meteor work is the observation granularity
itself: typical VMDB observation intervals are 10–15 minutes long,
corresponding to roughly 0.007–0.010 degrees of solar longitude. A fine
candidate grid at that resolution is the practice recommendation;
select_knots() will then reduce it to a handful of
statistically significant knots.
select_knots()The point of select_knots() is to keep the knot
positions explicit and few: a handful of values you can plot,
label, and reason about, instead of a dense anonymous knot grid hidden
inside a penalised smoother. The function is model-agnostic — your
score_fun picks the model family, offset, covariates, and
the criterion (BIC, AIC, or a cross-validation score) you optimise.
This function is the right tool when
Prefer alternatives when
mgcv::s(bs = "ps");mgcv::s(bs = "ad");earth::earth;select_knots() is
greedy and only finds a local optimum (see
?select_knots).Important clarification. select_knots()
chooses knots for a good fit, not for detecting extrema. Knots
typically land next to maxima or minima — where the curvature
is high — and not on them. Local extrema of the activity
profile become visible from the shape of the fitted curve, not
from the list of selected knots.
The implementation is generic, but its motivation comes directly from
the visual-meteor use case: activity profiles are the natural setting in
which knot positions can be physically interpreted — main
maximum, sub-maxima, shower onset and end. The same interpretive value
carries over to other meteor-related smooth fits — for instance the
apparent magnitude distribution of a shower. Alternative packages
(mgcv, earth) replace explicit knots with
continuous penalty mechanisms, which gives up that interpretation.
We use the package’s example data set PER_2015_rates
(visual rate observations of the Perseids 2015; see
?PER_2015_rates and ?load_vmdb). The raw
filtering follows standard practice:
data(PER_2015_rates)
obs <- PER_2015_rates$observations
obs <- subset(
obs,
rad_alt > 15 & moon_alt < 0 & sun_alt < -5 &
sl_start >= 139.2 & sl_end <= 140.6
)
deg2rad <- \(x) x * pi / 180
obs$sl <- (obs$sl_start + obs$sl_end) / 2
obs$t <- with(obs, t_eff * sin(deg2rad(rad_alt)) / f)
rates <- data.frame(
sl = obs$sl,
t = obs$t,
limmag = obs$lim_magn,
freq = obs$freq
)
rates <- rates[order(rates$sl), ]
nrow(rates)
#> [1] 1986What each step does:
rad_alt > 15 — when the radiant is
low above the horizon the geometric correction sin(rad_alt)
becomes numerically fragile and atmospheric extinction dominates;
observations below 15 degrees are discarded by convention.moon_alt < 0 — the Moon above the
horizon brightens the sky and reduces the effective limiting magnitude
in a way the simple limmag correction does not capture;
these observations are excluded rather than corrected.sun_alt < -5 — nautical or
astronomical twilight; before that the sky is too bright.sl_start >= 139.2 & sl_end <= 140.6
— only for this vignette: a tight window centred on the Perseid
main peak, chosen narrow enough to keep the vignette build fast while
still containing the peak (≈ 140.0 deg) and visible curvature on both
sides. In a real analysis you would use the full activity range (roughly
107–155 deg for the Perseids).sl = (sl_start + sl_end) / 2 — VMDB
observations are intervals; we summarise each by its midpoint, the
standard choice for the regressor.t = t_eff * sin(rad_alt) / f is the
central correction. It rescales the raw observing time to what the
observer would have spent under ideal sky conditions at their own
limiting magnitude: radiant in the zenith
(sin(rad_alt)) and a clear sky (f).
t_eff is the effective observing time after deducting
breaks; f is the cloud-cover correction factor
(>= 1, larger when more sky is obstructed). The
correction from each observer’s own limiting magnitude to the ZHR
reference 6.5 is not built into t; the GLM
below picks it up via the limmag covariate.We will call t simply the observing
time from now on — not to be confused with t_eff,
the raw effective observing time after breaks; t is
t_eff after the zenith (sin(rad_alt)) and
cloud-cover (/ f) corrections.
fit_glm <- \(rates, knots) {
f <- if (length(knots) == 0L) {
freq ~ splines::ns(sl) + limmag + offset(log(t))
} else {
freq ~ splines::ns(sl, knots = knots) +
limmag + offset(log(t))
}
stats::glm(f, data = rates, family = stats::poisson())
}
score_bic <- \(rates, knots) stats::BIC(fit_glm(rates, knots))offset(log(t)) makes the linear predictor a log-rate per
observing hour at the observer’s own limiting magnitude. The
limmag term then rescales that rate to the ZHR reference
(limmag = 6.5): exp(coef(limmag)) is the data-driven
population index r of the shower under this model. The
spline term splines::ns() is a natural cubic spline,
well-behaved at the boundaries.
A practical note on the offset: offset() inside a
formula is a formula-special and must stay unqualified — both
stats::offset(...) in the formula and passing the term via
glm()’s offset = argument break
predict() on newdata.
A regular grid is the simplest choice but ignores the actual
observation density. In practice the candidate grid is built from the
data themselves via freq_quantile() (see
?freq_quantile): partition the observations, ordered by
solar longitude, into bins each containing at least a fixed minimum
number of meteors, then use the meteor-weighted solar-longitude midpoint
of each bin as a candidate. This produces a fine grid in
observation-rich regions and a coarse grid where data is sparse —
exactly the resolution select_knots() benefits from:
rates$q <- vismeteor::freq_quantile(rates$freq, 12) # >= 12 meteors per bin
rates$qsl <- with(
rates,
ave(freq * sl, q, FUN = sum) /
ave(freq, q, FUN = sum)
)
cand <- unique(round(rates$qsl, 2))
length(cand)
#> [1] 103The round(..., 2) step collapses bins whose centres are
practically indistinguishable (resolution 0.01 deg). The minimum bin
count (12) controls how dense the candidate grid is; a
smaller value gives a finer grid (slower fit), a larger value a coarser
one.
The fixed_knots argument pins knot positions that must
be present in every fitted model. In forward mode they are set from the
start and the search adds further knots on top; in backward mode they
are never proposed for removal, neither singly nor by bulk removal.
fixed_knots need not be a subset of
knot_candidates — values supplied there are forced into the
working set regardless.
A fixed knot is a constraint on the search, not a finding
from it: the criterion no longer has the freedom to reject it. The
earlier warning that knots typically land next to extrema
rather than on them still applies — forcing a knot exactly on, say, a
presumed maximum overrides that pattern at the chosen position. Use it
when there is a non-data reason a position must be in the model (a
documented event time, an external anchor, a stable reference across
analyses); see ?select_knots for the full semantics.
The vignette runs serially (n_cores = 1L) because the
search window is narrow. For a full Perseids activity range (≈ 107–155
deg) the per-round candidate scoring benefits from
n_cores = parallel::detectCores() - 1L; see
?select_knots for the reproducibility recipe with
mclapply().
res_fwd <- vismeteor::select_knots(rates, cand, score_bic)
sort(c(res_fwd$knots, res_fwd$fixed_knots))
#> [1] 139.56 139.63 139.65 139.70 139.80 139.84 139.86 140.59
res_fwd$score
#> [1] 14106.83plot(res_fwd$history$n_knots, res_fwd$history$score,
type = "b", pch = 19, col = "blue",
xlab = "number of interior knots", ylab = "BIC",
main = "Forward selection: BIC trajectory"
)
points(tail(res_fwd$history$n_knots, 1L), res_fwd$score,
pch = 19, col = "red", cex = 1.5
)
abline(h = res_fwd$score, lty = 2, col = "grey50")The red point marks the score minimum — and with the default
n_steps = NULL it is also the endpoint of the
trajectory, because the search stops as soon as no further addition
improves the score. To see what one more step would do, re-run the
algorithm starting from the optimum with a positive
n_steps:
res_fwd_plus1 <- vismeteor::select_knots(rates, cand, score_bic,
fixed_knots = c(
res_fwd$knots,
res_fwd$fixed_knots
),
n_steps = 1L
)
res_fwd_plus1$score - res_fwd$score # always >= 0: how much worse one step makes it
#> [1] 2.593168n_steps is an exploration knob, not a
model-selection knob: it forces the algorithm to take a fixed number of
further iterations regardless of whether they improve the score, so that
the immediate score landscape past the optimum becomes visible through
history. A shallow rise tells you the BIC optimum is soft
and that a slightly richer model would not cost much; a steep rise tells
you the optimum is sharp and the parsimony is justified. The same recipe
— re-call with
fixed_knots = c(prev$knots, prev$fixed_knots), n_steps = k
— extends to as many follow-up steps as you like; see
?select_knots.
res_bwd <- vismeteor::select_knots(rates, cand, score_bic, backward = TRUE)
sort(c(res_bwd$knots, res_bwd$fixed_knots))
#> [1] 139.68 139.69 139.70 139.71 139.74 139.75 139.77 140.44 140.50 140.51
#> [11] 140.52 140.53
res_bwd$score
#> [1] 14060.24plot(res_bwd$history$n_knots, res_bwd$history$score,
type = "b", pch = 19, col = "blue",
xlab = "number of interior knots", ylab = "BIC",
main = "Backward elimination: BIC trajectory"
)
points(tail(res_bwd$history$n_knots, 1L), res_bwd$score,
pch = 19, col = "red", cex = 1.5
)
abline(h = res_bwd$score, lty = 2, col = "grey50")The trajectory starts at the full candidate pool on the right and
moves left as knots are removed; gaps along the x-axis correspond to
bulk removal rounds (see bulk_gap in
?select_knots). The red point again marks the score
minimum.
Both searches walk the lattice of knot subsets greedily — each round commits to the locally best add or drop — but they start at opposite corners:
That is why the two paths can disagree: at any intermediate
state the score of a candidate move depends on what is currently
selected. Adding k_i to the empty set is a genuinely
different operation from adding k_i to a set that already
contains its neighbours, and the two searches encounter k_i
in incompatible contexts.
Practical recommendation. For activity-profile
fitting we recommend backward = TRUE for the final
analysis, even though the function’s API default is FALSE
(the function is generic and defaults to the cheaper-per-fit direction).
The reason is qualitative, not performance: starting from the overfit
end ensures that all real curvature is already present in the initial
model, so the search only has to decide what is redundant — which is the
safer mode against the forward “miss-by-combination” failure. Use
backward = FALSE (forward) for a quick, sparse first sketch
or when the full-candidate initial fit is itself too expensive (very
large candidate pool, complex model family).
In practice neither direction is uniformly faster: forward iterates
many small fits one knot at a time, while backward starts with the
single most expensive fit and then collapses fast via the bulk-removal
mechanism (bulk_gap = 4L by default; see
?select_knots).
In any case, running both directions and keeping the lower-BIC result
is a cheap robustness check; large disagreements (in either knot count
or score) are a warning that the chosen criterion is on the flat part of
its landscape and that the data alone may not justify a unique
parsimonious model. The n_steps argument lets either search
continue for a fixed number of iterations past its first local minimum,
which is useful for inspecting how flat or sharp that minimum is.
best <- if (res_fwd$score <= res_bwd$score) {
c(res_fwd$knots, res_fwd$fixed_knots)
} else {
c(res_bwd$knots, res_bwd$fixed_knots)
}
fit <- fit_glm(rates, sort(best))
# Data-driven population index of the shower, with 95% Wald CI
# (log-scale CI on the limmag coefficient, exponentiated).
limmag_row <- summary(fit)$coefficients["limmag", ]
r_hat <- exp(limmag_row["Estimate"])
r_ci <- exp(limmag_row["Estimate"] +
c(-1, 1) * 1.96 * limmag_row["Std. Error"])
c(
estimate = unname(r_hat),
lower = unname(r_ci[1]),
upper = unname(r_ci[2])
)
#> estimate lower upper
#> 1.218461 1.192027 1.245482The point estimate is the population index r implied by
the rate-vs-limmag slope of this dataset; the CI width tells you how
sharply that slope is pinned down by the observations (a tight band
means a well-determined r, a wide band means the data leave
it essentially free). It enters the ZHR prediction below.
sl_grid <- seq(139.2, 140.6, length.out = 400)
pred_df <- data.frame(sl = sl_grid, limmag = 6.5, t = 1)
link <- predict(fit, newdata = pred_df, type = "link", se.fit = TRUE)
zhr <- exp(link$fit)
zhr_lo <- exp(link$fit - 1.96 * link$se.fit)
zhr_hi <- exp(link$fit + 1.96 * link$se.fit)
plot(sl_grid, zhr,
type = "n",
ylim = range(c(zhr_lo, zhr_hi)),
xlab = "solar longitude (deg)", ylab = "ZHR",
main = "PER 2015 - fitted activity profile (95% CI)"
)
polygon(c(sl_grid, rev(sl_grid)),
c(zhr_lo, rev(zhr_hi)),
col = adjustcolor("blue", alpha.f = 0.2), border = NA
)
lines(sl_grid, zhr, col = "blue", lwd = 2)
rug(best, col = "red")The prediction is evaluated with t = 1 (one ZHR-hour)
and limmag = 6.5 (the reference limiting magnitude), so the
response is the ZHR per hour on the chosen reference scale. The red rug
marks on the x-axis show the selected interior knots; note that they sit
next to the activity peak rather than on top of it, as expected
— that is where the curvature is highest.
The shaded band is the pointwise 95 % confidence interval of the fit.
We obtain it on the link (log) scale — where the GLM’s
standard errors are approximately Gaussian — and then transform back
with exp(). The band reflects parameter uncertainty of the
spline + linear predictor; it is not a prediction interval for
individual observations, which would also have to account for Poisson
dispersion of the counts themselves.
Raw residuals of individual Poisson observations with small
expectations are visually uninformative. A cleaner diagnostic groups the
observations into bins of fixed minimum meteor count using
freq_quantile() and compares the observed and expected
totals per bin.
# Use a coarser binning here than for the candidate grid above
# (>= 500 meteors per bin) so that bin uncertainties are small enough
# to read systematic deviations from the diagnostic plot.
rates$qres <- vismeteor::freq_quantile(rates$freq, 500)
rates$pred <- predict(fit, type = "response")
bins <- with(rates, {
obs_n <- tapply(freq, qres, sum)
pred_n <- tapply(pred, qres, sum)
data.frame(
sl = tapply(sl * freq, qres, sum) / obs_n,
ratio = obs_n / pred_n,
se = sqrt(obs_n) / pred_n
)
})
nrow(bins)
#> [1] 40
plot(bins$sl, bins$ratio,
pch = 19, col = "blue",
ylim = range(c(
bins$ratio - 2 * bins$se,
bins$ratio + 2 * bins$se
)),
xlab = "solar longitude (deg)",
ylab = "observed / expected",
main = "PER 2015 - binned residual ratios (+/- 1 sigma)"
)
arrows(bins$sl, bins$ratio - bins$se,
bins$sl, bins$ratio + bins$se,
angle = 90, code = 3, length = 0.03, col = "blue"
)
abline(h = 1, lty = 2, col = "grey50")A well-fitting model scatters the bin ratios randomly around 1, with about two thirds of the one-sigma error bars crossing the dashed line. Systematic arches over solar longitude would indicate too few knots in a region; visible scatter with no trend is Poisson over-dispersion (not pursued here).
The same freq_quantile() helper drives both the
candidate grid above and the diagnostic here, but with different minimum
bin sizes: a small minimum (12) for a fine candidate grid that
select_knots() can prune from, and a larger minimum (500)
for a small number of stable bins on which residual deviations are
visually detectable. In the narrow vignette window the larger minimum
yields only a handful of bins (see the nrow(bins) output);
a full activity-range analysis would lower this minimum to keep enough
bins for a visible trend.
?select_knots — the function reference.?freq_quantile — quantile bins used here for the
residual diagnostic.?PER_2015_rates — the example data set.?load_vmdb — importing rate and magnitude observations
from the VMDB API.