| Type: | Package |
| Title: | Similarity Retrieval and Local Learning for Spectral Chemometrics |
| Version: | 3.0.0 |
| Date: | 2026-04-15 |
| BugReports: | https://github.com/l-ramirez-lopez/resemble/issues |
| Description: | Functions for dissimilarity analysis and machine learning in complex spectral data sets, including memory-based learning (MBL), optimal subset search and selection, and retrieval-based modelling with model libraries. Supports local learning, optimisation of spectral libraries, and ensemble prediction from precomputed models. Most of these functions are based on the methods presented in Ramirez-Lopez et al. (2013) <doi:10.1016/j.geoderma.2012.12.014>. |
| License: | MIT + file LICENSE |
| URL: | https://l-ramirez-lopez.github.io/resemble/, https://github.com/l-ramirez-lopez/resemble |
| Depends: | R (≥ 4.2.0) |
| Imports: | foreach, iterators, RhpcBLASctl, Rcpp (≥ 1.0.3), mathjaxr (≥ 1.0), lifecycle (≥ 0.2.0) |
| Suggests: | prospectr, parallel, doParallel, testthat, quarto, knitr |
| LinkingTo: | Rcpp, RcppArmadillo |
| RdMacros: | mathjaxr |
| VignetteBuilder: | quarto |
| NeedsCompilation: | yes |
| Repository: | CRAN |
| RoxygenNote: | 7.3.3 |
| Encoding: | UTF-8 |
| Config/VersionName: | vortex |
| Packaged: | 2026-04-20 18:50:53 UTC; leo |
| Author: | Leonardo Ramirez-Lopez
|
| Maintainer: | Leonardo Ramirez-Lopez <ramirez.lopez.leo@gmail.com> |
| Date/Publication: | 2026-04-20 20:40:02 UTC |
Overview of the functions in the resemble package
Description
Functions for dissimilarity assessment, nearest-neighbour search, memory-based learning, local expert libraries, and evolutionary training subset search in spectral chemometrics.
Details
This is the version 3.0.0 – vortex of the package. It implements a number of functions useful for modeling complex spectral spectra (e.g. NIR, IR). The package includes functions for dimensionality reduction, computing spectral dissimilarity matrices, nearest neighbor search, and modeling spectral data using memory-based learning and evolutionary search of optimal training subsets in large and complex datasets. This package builds upon the methods presented in Ramirez-Lopez et al. (2013a) doi:10.1016/j.geoderma.2012.12.014, Ramirez-Lopez et al. (2026a) and Ramirez-Lopez et al. (2026b).
Development versions can be found in the github repository of the package at https://github.com/l-ramirez-lopez/resemble.
The functions available for computing dissimilarity matrices are:
dissimilarity: Computes a dissimilarity matrix based on a specified method.diss_pca: constructor for principal components-based dissimilarity method.diss_pls: constructor for partial least squares-based dissimilarity method.diss_correlation: constructor for correlation-based dissimilarity method.diss_euclidean: constructor for euclidean distance-based dissimilarity method.diss_mahalanobis: constructor for Mahalanobis distance-based dissimilarity method.diss_cosine: constructor for cosine-based dissimilarity method.
The functions available for evaluating dissimilarity matrices are:
diss_evaluate: Evaluates the effectiveness of a dissimilarity matrix using side information.
The functions available for nearest neighbor search:
search_neighbors: Search for nearest neighbors of a query spectrum in a reference dataset based on a specified dissimilarity method.
The functions available for modeling spectral data:
-
mbl: Memory-based learning for modeling spectral data. -
gesearch: An evolutionary method to search optimal samples in large spectral datasets. -
liblex: Builds a library of reusable localized models.
The functions available for dimensionality reduction are:
ortho_projection: Computes an orthogonal projection of the data based on either principal components or partial least squares.
Author(s)
Maintainer / Creator: Leonardo Ramirez-Lopez ramirez.lopez.leo@gmail.com
Authors:
References
Ramirez-Lopez, L., Viscarra Rossel, R., Behrens, T., Orellano, C., Perez-Fernandez, E., Kooijman, L., Wadoux, A. M. J.-C., Breure, T., Summerauer, L., Safanelli, J. L., & Plans, M. (2026a). When spectral libraries are too complex to search: Evolutionary subset selection for domain-adaptive calibration. Analytica Chimica Acta, under review.
Ramirez-Lopez, L., Metz, M., Lesnoff, M., Orellano, C., Perez-Fernandez, E., Plans, M., Breure, T., Behrens, T., Viscarra Rossel, R., & Peng, Y. (2026b). Rethinking local spectral modelling: From per-query refitting to model libraries. Analytica Chimica Acta, under review.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. (2013a). The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
See Also
Useful links:
Report bugs at https://github.com/l-ramirez-lopez/resemble/issues
Correlation dissimilarity method constructor
Description
Creates a configuration object that fully specifies a correlation (or moving
correlation) dissimilarity method. Pass the result to dissimilarity()
to compute the dissimilarity matrix.
Usage
diss_correlation(ws = NULL, center = TRUE, scale = FALSE)
Arguments
ws |
Either |
center |
Logical. Should the data be mean-centered before computing
dissimilarities? Centering is applied jointly to |
scale |
Logical. Should the data be scaled (divided by column standard
deviations) before computing dissimilarities? Scaling is applied jointly
to |
Details
The correlation dissimilarity between two observations x_i and
x_j is:
d(x_i, x_j) = \frac{1}{2}(1 - \rho(x_i, x_j))
where \rho is the Pearson correlation coefficient. This is used when
ws = NULL.
When ws is specified, the moving correlation dissimilarity is:
d(x_i, x_j; ws) = \frac{1}{2\,ws} \sum_{k=1}^{p - ws}
\bigl(1 - \rho(x_{i,(k:k+ws)},\, x_{j,(k:k+ws)})\bigr)
where ws is the window size and p is the number of variables.
Value
An object of class c("diss_correlation", "diss_method") — a
list holding the validated method parameters. Intended to be passed to
dissimilarity(), not used directly.
Parallel execution
The underlying C++ implementation uses OpenMP for parallel computation.
Thread count is controlled by the OMP_NUM_THREADS environment
variable. To limit threads (e.g., when calling from within a parallel
backend):
Sys.setenv(OMP_NUM_THREADS = 1)
Author(s)
See Also
dissimilarity, diss_euclidean,
diss_mahalanobis, diss_cosine
Examples
# Standard correlation dissimilarity
m <- diss_correlation()
# Moving correlation with window size 41
m <- diss_correlation(ws = 41)
# Without centering
m <- diss_correlation(center = FALSE)
Cosine dissimilarity method constructor
Description
Creates a configuration object for computing cosine dissimilarity
(also known as spectral angle mapper). Pass the result to
dissimilarity() to compute the dissimilarity matrix.
The cosine dissimilarity between two observations x_i and
x_j is:
c(x_i, x_j) = \cos^{-1}
\frac{\sum_{k=1}^{p} x_{i,k}\, x_{j,k}}
{\sqrt{\sum_{k=1}^{p} x_{i,k}^{2}}\,
\sqrt{\sum_{k=1}^{p} x_{j,k}^{2}}}
where p is the number of variables.
Usage
diss_cosine(center = TRUE, scale = FALSE)
Arguments
center |
Logical. Center the data before computing dissimilarities?
Applied jointly to |
scale |
Logical. Scale the data before computing dissimilarities?
Applied jointly to |
Value
An object of class c("diss_cosine", "diss_method").
Author(s)
See Also
dissimilarity, diss_euclidean,
diss_mahalanobis
Examples
m <- diss_cosine()
m <- diss_cosine(center = FALSE)
Euclidean dissimilarity method constructor
Description
Creates a configuration object for computing Euclidean dissimilarity.
Pass the result to dissimilarity() to compute the dissimilarity
matrix.
The scaled Euclidean dissimilarity between two observations x_i and
x_j is:
d(x_i, x_j) = \sqrt{\frac{1}{p} \sum_{k=1}^{p}(x_{i,k} - x_{j,k})^2}
where p is the number of variables. Results are equivalent to
stats::dist() but scaled by 1/p.
Usage
diss_euclidean(center = TRUE, scale = FALSE)
Arguments
center |
Logical. Center the data before computing distances?
Applied jointly to |
scale |
Logical. Scale the data before computing distances?
Applied jointly to |
Value
An object of class c("diss_euclidean", "diss_method").
Author(s)
See Also
dissimilarity, diss_mahalanobis,
diss_cosine
Examples
m <- diss_euclidean()
m <- diss_euclidean(center = FALSE, scale = TRUE)
Evaluate dissimilarity matrices
Description
Evaluates a dissimilarity matrix by comparing each observation to its nearest neighbor based on side information. For continuous variables, RMSD and correlation are computed; for categorical variables, the kappa index is used.
Usage
diss_evaluate(diss, side_info)
sim_eval(d, side_info)
Arguments
diss |
A symmetric dissimilarity matrix. Alternatively, a vector
containing the lower triangle values (as returned by |
side_info |
A matrix of side information corresponding to the observations. Can be numeric (one or more columns) or character (single column for categorical data). |
d |
Deprecated. Use |
Details
This function assesses whether a dissimilarity matrix captures meaningful structure by examining the side information of nearest neighbor pairs (Ramirez-Lopez et al., 2013). If observations that are similar in the dissimilarity space also have similar side information values, the dissimilarity is considered effective.
For numeric side_info, the root mean square of differences (RMSD)
between each observation and its nearest neighbor is computed:
where \(NN(x_i, X^{-i})\) returns the index of the nearest neighbor of observation \(i\) (excluding itself), \(y_i\) is the side information value for observation \(i\), and \(m\) is the number of observations.
For categorical side_info, the kappa index is computed:
where \(p_o\) is the observed agreement and \(p_e\) is the agreement expected by chance.
Value
A list with the following components:
- eval
For numeric side information: a matrix with columns
rmsdandr(correlation). For categorical: a matrix with columnkappa.- global_eval
If multiple numeric side information variables are provided, summary statistics across variables.
- first_nn
A matrix with the original side information and the side information of each observation's nearest neighbor.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
Examples
library(prospectr)
data(NIRsoil)
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)
NIRsoil$spc <- sg
Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
# Compute PCA-based dissimilarity
d <- dissimilarity(Xr, diss_method = diss_pca(ncomp = 8))
# Evaluate using side information
ev <- diss_evaluate(d$dissimilarity, side_info = as.matrix(Yr))
ev$eval
# Evaluate with multiple side information variables
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
ev_2 <- diss_evaluate(d$dissimilarity, side_info = cbind(Yr, Yr_2))
ev_2$eval
ev_2$global_eval
Mahalanobis dissimilarity method constructor
Description
Creates a configuration object for computing Mahalanobis dissimilarity.
Pass the result to dissimilarity() to compute the dissimilarity
matrix.
The Mahalanobis distance is computed by first transforming the data into
Mahalanobis space via a factorization of the inverse covariance matrix
M^{-1} = W^{T}W (using SVD), then applying Euclidean distance in
that transformed space:
d(x_i, x_j) = \sqrt{\frac{1}{p}(x_i - x_j)M^{-1}(x_i - x_j)^T}
Usage
diss_mahalanobis(center = TRUE, scale = FALSE)
Arguments
center |
Logical. Center the data before computing distances?
Applied jointly to |
scale |
Logical. Scale the data before computing distances?
Applied jointly to |
Value
An object of class c("diss_mahalanobis", "diss_method").
Important limitations
The covariance matrix will be singular — and the distance therefore
uncomputable — when the number of observations is smaller than the number
of variables, or when variables are perfectly collinear. This is common
with raw spectral data; consider using diss_euclidean() on
PCA scores instead.
Author(s)
See Also
dissimilarity, diss_euclidean,
diss_cosine
Examples
m <- diss_mahalanobis()
m <- diss_mahalanobis(center = TRUE, scale = TRUE)
PCA dissimilarity method constructor
Description
Creates a configuration object for computing dissimilarity based on Mahalanobis distance in PCA score space.
Usage
diss_pca(
ncomp = ncomp_by_var(0.01),
method = c("pca", "pca_nipals"),
center = TRUE,
scale = FALSE,
return_projection = FALSE
)
Arguments
ncomp |
Component selection method. Can be:
|
method |
Character. PCA algorithm: |
center |
Logical. Center data before projection? Default |
scale |
Logical. Scale data before projection? Default |
return_projection |
Logical. Return the projection object?
Default |
Value
An object of class c("diss_pca", "diss_method").
See Also
Component selection: ncomp_by_var, ncomp_by_cumvar,
ncomp_by_opc, ncomp_fixed
Other dissimilarity methods: diss_pls,
diss_correlation, diss_euclidean,
diss_cosine, diss_mahalanobis
Examples
# Fixed number of components
diss_pca(ncomp = 10)
diss_pca(ncomp = ncomp_fixed(10))
# Retain components explaining >= 1% variance each (default)
diss_pca(ncomp = ncomp_by_var(0.01))
# Retain components until 99% cumulative variance
diss_pca(ncomp = ncomp_by_cumvar(0.99))
# Optimize using side information (requires Yr)
diss_pca(ncomp = ncomp_by_opc(40))
diss_pca(ncomp = ncomp_by_opc())
# NIPALS algorithm (useful for very large matrices)
diss_pca(ncomp = 10, method = "pca_nipals")
PLS dissimilarity method constructor
Description
Creates a configuration object for computing dissimilarity based on
Mahalanobis distance in PLS score space. Requires Yr in
dissimilarity().
Usage
diss_pls(
ncomp = ncomp_by_opc(),
method = c("pls", "mpls"),
scale = FALSE,
return_projection = FALSE
)
Arguments
ncomp |
Component selection method. Can be:
|
method |
Character. PLS algorithm: |
scale |
Logical. Scale data? Default |
return_projection |
Logical. Return projection object?
Default |
Value
An object of class c("diss_pls", "diss_method").
Author(s)
See Also
Component selection: ncomp_by_var, ncomp_by_cumvar,
ncomp_by_opc, ncomp_fixed
Other dissimilarity methods: diss_pca,
diss_correlation, diss_euclidean,
diss_cosine, diss_mahalanobis
Examples
# Default: OPC optimization (recommended)
diss_pls()
# Fixed number of components
diss_pls(ncomp = 15)
# Custom opc settings
diss_pls(ncomp = ncomp_by_opc(max_ncomp = 50))
# Modified PLS
diss_pls(ncomp = 10, method = "mpls")
Compute dissimilarity matrices
Description
Computes dissimilarity matrices between observations using various methods. This is the main interface for dissimilarity computation in the resemble package.
Usage
dissimilarity(Xr, Xu = NULL, diss_method = diss_pca(), Yr = NULL)
Arguments
Xr |
A numeric matrix of reference observations (rows) and variables (columns). |
Xu |
Optional matrix of additional observations with the same variables. |
diss_method |
A dissimilarity method object created by one of:
Default is |
Yr |
Optional response matrix. Required for PLS methods and when using
|
Details
The function dispatches to the appropriate internal computation based on the
class of diss_method. Each method constructor (e.g., diss_pca())
encapsulates all method-specific parameters including component selection,
centering, scaling, and whether to return projections.
Output dimensions
When only Xr is provided, the function computes pairwise dissimilarities
among all observations in Xr, returning a symmetric
nrow(Xr) \(\times\) nrow(Xr) matrix.
When both Xr and Xu are provided, the function computes
dissimilarities between each observation in Xr and each observation
in Xu, returning a nrow(Xr) \(\times\) nrow(Xu)
matrix where element \((i, j)\) is the dissimilarity between the
\(i\)-th observation in Xr and the \(j\)-th observation
in Xu.
Mahalanobis distance
Note that diss_mahalanobis() computes Mahalanobis distance directly on
the input variables. This requires the covariance matrix to be invertible,
which fails when the number of variables exceeds the number of observations
or when variables are highly correlated (common in spectral data). For such
cases, use diss_pca() or diss_pls() instead.
Value
A list of class "dissimilarity" containing:
- dissimilarity
The computed dissimilarity matrix. Dimensions are
nrow(Xr)\(\times\)nrow(Xr)whenXu = NULL, ornrow(Xr)\(\times\)nrow(Xu)otherwise.- diss_method
The
diss_*constructor object used for computation.- center
Vector used to center the data.
- scale
Vector used to scale the data.
- ncomp
Number of components used (for projection methods).
- projection
If
return_projection = TRUEin the method constructor, theortho_projectionobject.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
diss_pca, diss_pls,
diss_correlation, diss_euclidean,
diss_mahalanobis, diss_cosine
Examples
library(prospectr)
data(NIRsoil)
# Preprocess
sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15)
Xr <- sg[as.logical(NIRsoil$train), ]
Xu <- sg[!as.logical(NIRsoil$train), ]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]
# PCA-based dissimilarity with variance-based selection
d1 <- dissimilarity(Xr, Xu, diss_method = diss_pca())
# PCA with OPC selection (requires Yr)
d2 <- dissimilarity(Xr, Xu,
Yr = Yr,
diss_method = diss_pca(
ncomp = ncomp_by_opc(30),
return_projection = TRUE
)
)
# PLS-based dissimilarity
d3 <- dissimilarity(
Xr, Xu,
Yr = Yr,
diss_method = diss_pls(
ncomp = ncomp_by_opc(30)
)
)
# Euclidean distance
d4 <- dissimilarity(Xr, Xu, diss_method = diss_euclidean())
# Correlation dissimilarity with moving window
d5 <- dissimilarity(Xr, Xu, diss_method = diss_correlation(ws = 41))
# Mahalanobis distance (use only when n > p and low collinearity)
# d6 <- dissimilarity(Xr[, 1:20], Xu[, 1:20],
# diss_method = diss_mahalanobis())
Local fitting method constructors
Description
These functions create configuration objects that specify how local
regression models are fitted within the mbl function.
Usage
fit_pls(ncomp, method = c("pls", "mpls", "simpls"),
scale = FALSE, max_iter = 100L, tol = 1e-6)
fit_wapls(min_ncomp, max_ncomp, method = c("mpls", "pls", "simpls"),
scale = FALSE, max_iter = 100L, tol = 1e-6)
fit_gpr(noise_variance = 0.001, center = TRUE, scale = TRUE)
Arguments
ncomp |
an integer indicating the number of PLS components to use
in local regressions when |
min_ncomp |
an integer indicating the minimum number of PLS components
to use in local regressions when |
max_ncomp |
an integer indicating the maximum number of PLS components
to use in local regressions when |
method |
a character string indicating the PLS algorithm to use. Options are:
Default is |
scale |
logical indicating whether predictors must be scaled.
Default is |
max_iter |
an integer indicating the maximum number of iterations
for convergence in the NIPALS algorithm. Only used when
|
tol |
a numeric value indicating the convergence tolerance for
calculating scores in the NIPALS algorithm. Only used when
|
noise_variance |
a numeric value indicating the variance of the noise
for Gaussian process local regressions ( |
center |
logical indicating whether predictors should be centered
before fitting. Only used for |
Details
These functions create configuration objects that are passed to
mbl to specify how local regression models are fitted.
There are three fitting methods available:
Partial least squares (fit_pls)
Uses orthogonal scores partial least squares regression. Three algorithm variants are available:
Standard PLS (
method = 'pls'): Uses the NIPALS algorithm with covariance-based weights.Modified PLS (
method = 'mpls'): Uses the NIPALS algorithm with correlation-based weights. Proposed by Shenk and Westerhaus (1991), this approach gives equal influence to all predictors regardless of their variance scale.SIMPLS (
method = 'simpls'): Uses the SIMPLS algorithm (de Jong, 1993), which deflates the cross-product matrix rather than X itself. This is computationally faster, especially for wide matrices, and produces identical predictions to standard PLS.
The only parameter to optimise is the number of PLS components
(ncomp).
Weighted average PLS (fit_wapls)
This method was developed by Shenk et al. (1997) and is used as the
regression method in the LOCAL algorithm. It fits multiple PLS models
using different numbers of components (from min_ncomp to
max_ncomp). The final prediction is a weighted average of
predictions from all models, where the weight for component \(j\)
is:
where \(s_{1:j}\) is the root mean square of the spectral reconstruction error of the target observation(s) when \(j\) PLS components are used, and \(g_{j}\) is the root mean square of the squared regression coefficients for the \(j\)th component.
The same algorithm variants ('pls', 'mpls', 'simpls')
are available. The default is 'mpls' following the original LOCAL
implementation.
Gaussian process regression (fit_gpr)
Gaussian process regression is a non-parametric Bayesian method characterised by a mean and covariance function. This implementation uses a dot product covariance.
The prediction vector \(A\) is computed from training data (\(X\), \(Y\)) as:
\[A = (X X^{T} + \sigma^2 I)^{-1} Y\]where \(\sigma^2\) is the noise variance and \(I\) is the identity matrix. Prediction for a new observation \(x_{u}\) is:
\[\hat{y}_{u} = x_{u} X^{T} A\]The only parameter is the noise variance (noise_variance).
Value
An object of class c("fit_<method>", "fit_method")
containing the specified parameters. This object is passed to
mbl to configure local model fitting.
Author(s)
References
de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251-263.
Rasmussen, C.E., Williams, C.K. (2006). Gaussian Processes for Machine Learning. MIT Press.
Shenk, J.S., & Westerhaus, M.O. (1991). Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.
Shenk, J., Westerhaus, M., & Berzaghi, P. (1997). Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
Westerhaus, M. (2014). Eastern Analytical Symposium Award for outstanding achievements in near infrared spectroscopy: my contributions to near infrared spectroscopy. NIR news, 25(8), 16-20.
See Also
Examples
# PLS with 10 components using standard algorithm
fit_pls(ncomp = 10)
# PLS with modified algorithm (correlation-based weights)
fit_pls(ncomp = 10, method = "mpls")
# PLS with SIMPLS (faster, no iteration)
fit_pls(ncomp = 10, method = "simpls")
# Weighted average PLS (LOCAL-style)
fit_wapls(min_ncomp = 3, max_ncomp = 12)
# Weighted average PLS with SIMPLS
fit_wapls(min_ncomp = 3, max_ncomp = 15, method = "simpls")
# Gaussian process regression
fit_gpr()
fit_gpr(noise_variance = 0.01)
Evolutionary sample search for context-specific calibrations
Description
Implements an evolutionary search algorithm that selects a subset from large reference datasets (e.g., spectral libraries) to build context-specific calibrations. The algorithm iteratively removes weak or non-informative samples based on prediction error, spectral reconstruction error, or dissimilarity criteria. This implementation is based on the methods proposed in Ramirez-Lopez et al. (2026a).
Usage
## Default S3 method:
gesearch(Xr, Yr, Xu, Yu = NULL, Yu_lims = NULL,
k, b, retain = 0.95, target_size = k,
fit_method = fit_pls(ncomp = 10),
optimization = "reconstruction",
group = NULL, control = gesearch_control(),
intermediate_models = FALSE,
verbose = TRUE, seed = NULL, pchunks = 1L, ...)
## S3 method for class 'formula'
gesearch(formula, train, test, k, b, target_size, fit_method,
..., na_action = na.pass)
## S3 method for class 'gesearch'
predict(object, newdata, type = "response",
what = c("final", "all_generations"), ...)
## S3 method for class 'gesearch'
plot(x, which = c("weakness", "removed"), ...)
Arguments
Xr |
A numeric matrix of predictor variables for the reference data (observations in rows, variables in columns). |
Yr |
A numeric vector or single-column matrix of response values
corresponding to |
Xu |
A numeric matrix of predictor variables for target observations
(same structure as |
Yu |
An optional numeric vector or single-column matrix of response
values for |
Yu_lims |
A numeric vector of length 2 specifying expected response
limits for the target population. Used with |
k |
An integer specifying the number of samples in each resampling subset (gene size). |
b |
An integer specifying the target average number of times each training sample is evaluated per iteration. Higher values (e.g., >40) produce more stable results but increase computation time. |
retain |
A numeric value in (0, 1] specifying the proportion of samples
retained per iteration. Default is 0.95. Values >0.9 are recommended for
stability. See |
target_size |
An integer specifying the target number of selected
samples (gene pool size). Must be >= |
fit_method |
A fit method object created with |
optimization |
A character vector specifying optimization criteria:
Multiple criteria can be combined, e.g.,
|
group |
An optional factor assigning group labels to training observations. Used for leave-group-out cross-validation to avoid pseudo-replication. |
control |
A list created with |
intermediate_models |
A logical indicating whether to store models for
each intermediate generation. Default is |
verbose |
A logical indicating whether to print progress information.
Default is |
seed |
An integer for random number generation to ensure
reproducibility. Default is |
pchunks |
An integer specifying the chunk size used for memory-efficient parallel processing. Larger values divide the workload into smaller pieces, which can help reduce memory pressure. Default is 1L. |
formula |
A |
train |
A data.frame containing training data with model variables. |
test |
A data.frame containing test data with model variables. |
na_action |
A function for handling missing values in training data.
Default is |
object |
A fitted |
newdata |
A matrix or data.frame of new observations. For formula-fitted models, a data.frame containing all predictor variables is accepted. For non-formula models, a matrix is required. |
type |
A character string specifying the prediction type. Currently only
|
what |
A character string specifying which models to use for prediction:
|
x |
A |
which |
Character string specifying what to plot:
|
... |
Additional arguments passed to methods. |
Details
The gesearch algorithm requires a large reference dataset (Xr)
where the sample search is conducted, target observations (Xu), and
three tuning parameters: k, b, and retain.
The target observations (Xu) should represent the population of
interest. These may be selected via algorithms like Kennard-Stone when
response values are unavailable.
The algorithm iteratively removes weak samples from Xr based on:
Increased RMSE when predicting
YuIncreased PLS reconstruction error on
XuIncreased dissimilarity to
Xuin PLS space
A resampling scheme identifies samples that consistently appear in
high-error subsets. These are labeled weak and removed. The process
continues until approximately target_size samples remain.
The gesearch() function also returns a final model fitted on the selected
samples, which can be used for prediction. This model is internally validated
by cross-validation using only the selected samples from the training/reference
set. If Yu is available, a model fitted only on the selected reference samples
is first used to predict the target samples. The final model is then refitted
using both the selected reference samples and the target samples used to guide
the search, provided that response values are available for those target samples.
Parameter guidance
-
k: Number of samples per resampling subset. See Lobsey et al. (2017) for guidance. -
b: Resampling intensity. Higher values increase stability but computational cost. -
retain: Proportion retained per iteration. Values >0.9 recommended.
Prediction
The predict method generates predictions from a fitted
gesearch object. If the model was fitted with a formula,
newdata is validated and transformed to the appropriate model matrix.
When what = "all_generations", the return value is a named list with
one element per generation, where each element contains a prediction
matrix. This option requires intermediate_models = TRUE during
fitting.
Value
For gesearch: A list of class "gesearch" containing:
-
x_local: Matrix of predictors for selected samples. -
y_local: Vector of responses for selected samples. -
indices: Indices of selected samples from original training set. -
complete_iter: Number of completed iterations. -
iter_weakness: List with iteration-level weakness statistics. -
samples: List of sample indices retained at each iteration. -
n_removed: data.frame of samples removed per iteration. -
control: Copy of control parameters. -
fit_method: Fit constructor fromfit_method. -
validation_results: Cross-validation in the training only set validation on the test set using models built only with the samples found. -
final_models: Final PLS model containing coefficients, loadings, scores, VIP, and selectivity ratios. -
intermediate_models: List of models per generation (ifintermediate_models = TRUE). -
seed: RNG seed used.
For predict.gesearch:
If
what = "final": a prediction matrix withnrow(newdata)rows and one column per PLS component.If
what = "all_generations": a named list of generations, where each generation contains a prediction matrix as above.
Author(s)
Leonardo Ramirez-Lopez, Claudio Orellano, Craig Lobsey, Raphael Viscarra Rossel
References
Lobsey, C.R., Viscarra Rossel, R.A., Roudier, P., Hedley, C.B. 2017. rs-local data-mines information from spectral libraries to improve local calibrations. European Journal of Soil Science 68:840-852.
Kennard, R.W., Stone, L.A. 1969. Computer aided design of experiments. Technometrics 11:137-148.
Rajalahti, T., Arneberg, R., Berven, F.S., Myhr, K.M., Ulvik, R.J., Kvalheim, O.M. 2009. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemometrics and Intelligent Laboratory Systems 95:35-48.
Ramirez-Lopez, L., Viscarra Rossel, R., Behrens, T., Orellano, C., Perez-Fernandez, E., Kooijman, L., Wadoux, A. M. J.-C., Breure, T., Summerauer, L., Safanelli, J. L., & Plans, M. (2026a). When spectral libraries are too complex to search: Evolutionary subset selection for domain-adaptive calibration. Analytica Chimica Acta, under review.
See Also
fit_pls, gesearch_control, mbl
Examples
## Not run:
library(prospectr)
data(NIRsoil)
# Preprocess
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc))),
m = 1, p = 1, w = 7
)
NIRsoil$spc_pr <- sg_det
# Split data
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$Ciso), ]
train_y <- NIRsoil$Ciso[NIRsoil$train == 1 & !is.na(NIRsoil$Ciso)]
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$Ciso), ]
test_y <- NIRsoil$Ciso[NIRsoil$train == 0 & !is.na(NIRsoil$Ciso)]
# Basic search with reconstruction and similarity optimizations
gs <- gesearch(
Xr = train_x, Yr = train_y,
Xu = test_x, Yu = test_y,
k = 50, b = 100, retain = 0.97,
target_size = 200,
fit_method = fit_pls(ncomp = 15, method = "mpls"),
optimization = c("reconstruction", "similarity"),
control = gesearch_control(retain_by = "probability"),
seed = 42
)
# Predict
preds <- predict(gs, test_x)
# Plot progress
plot(gs)
plot(gs, which = "removed")
# With reconstruction and response optimization (requires Yu)
gs_response <- gesearch(
Xr = train_x, Yr = train_y,
Xu = test_x, Yu = test_y,
k = 50, b = 100, retain = 0.97,
target_size = 200,
fit_method = fit_pls(ncomp = 15),
optimization = c("reconstruction", "response"),
seed = 42
)
# Parallel processing
library(doParallel)
n_cores <- min(2, parallel::detectCores() - 1)
cl <- makeCluster(n_cores)
registerDoParallel(cl)
gs_parallel <- gesearch(
Xr = train_x, Yr = train_y,
Xu = test_x,
k = 50, b = 100, retain = 0.97,
target_size = 200,
fit_method = fit_pls(ncomp = 15),
pchunks = 3,
seed = 42
)
stopCluster(cl)
registerDoSEQ()
## End(Not run)
Control parameters for gesearch
Description
Creates a control object specifying algorithm parameters for
gesearch.
Usage
gesearch_control(
retain_by = c("probability", "proportion"),
percentile_type = 7L,
tune = FALSE,
number = 10L,
p = 0.75,
stagnation_limit = 5L,
allow_parallel = TRUE,
blas_threads = 1L
)
Arguments
retain_by |
A character string specifying how training observations are selected at each iteration:
|
percentile_type |
An integer between 1 and 9 specifying the quantile
algorithm when |
tune |
A logical indicating whether to tune regression parameters
(e.g., number of PLS components) via cross-validation at each iteration.
Increases computation time substantially. Default is |
number |
An integer specifying the number of groups for leave-group-out
cross-validation. Default is 10. This is used for validating the final
models built the samples found. When |
p |
A numeric value in (0, 1) specifying the proportion of observations
per group in leave-group-out cross-validation. Default is 0.75. When
|
stagnation_limit |
An integer specifying the maximum number of consecutive iterations with no change in gene pool size before early termination. Prevents infinite loops when target size cannot be reached. Default is 5. |
allow_parallel |
A logical indicating whether to enable parallel
processing for internal resampling and calibration. The parallel backend
must be registered by the user. Default is |
blas_threads |
An integer specifying the number of BLAS threads to use during computation. Default is 1, which avoids multi-threaded OpenBLAS overhead on Linux. Requires RhpcBLASctl. See Details. |
Details
Retention strategies
When retain_by = "probability" (default), observations with errors
below a percentile threshold are retained. The percentile is computed using
quantile with probs set to the retain
value from gesearch. This approach is more robust when outlier
observations have extreme error values.
When retain_by = "proportion", a fixed fraction of observations
(specified by the retain argument in gesearch) with the
lowest associated errors are kept at each iteration.
Cross-validation for tuning
When tune = TRUE, leave-group-out cross-validation is used to select
optimal regression parameters at each iteration. The number argument
controls how many CV groups are formed, and p controls the proportion
of observations in each group.
BLAS threading
On Linux systems with multi-threaded OpenBLAS, the default thread count
can cause significant overhead for algorithms that perform many small
matrix operations (like the iterative PLS fits in gesearch).
Setting blas_threads = 1 (the default) eliminates this overhead.
This setting requires the RhpcBLASctl package. If not installed,
the parameter is ignored and a message is displayed. The original thread
count is restored when gesearch completes.
Value
A list of class "gesearch_control" containing the specified
parameters.
Author(s)
References
Lobsey, C.R., Viscarra Rossel, R.A., Roudier, P., Hedley, C.B. 2017. rs-local data-mines information from spectral libraries to improve local calibrations. European Journal of Soil Science 68:840-852.
See Also
Examples
# Default parameters (probability-based retention)
gesearch_control()
# Proportion-based retention
gesearch_control(retain_by = "proportion")
# Enable parameter tuning with custom CV settings
gesearch_control(tune = TRUE, number = 5, p = 0.8)
Build a precomputed library of localised experts using memory-based learning
Description
Constructs a library of local predictive models based on memory-based learning (MBL). For each anchor observation, a local regression model is fitted using its nearest neighbors from the reference set. This implementation is based on the methods proposed in Ramirez-Lopez et al. (2026b).
Usage
liblex(Xr, Yr, neighbors,
diss_method = diss_pca(ncomp = ncomp_by_opc()),
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
anchor_indices = NULL, gh = TRUE, group = NULL,
control = liblex_control(), verbose = TRUE, ...)
## S3 method for class 'liblex'
predict(object, newdata, diss_method = NULL,
weighting = c("gaussian", "tricube", "triweight", "triangular",
"quartic", "parabolic", "cauchy", "none"),
adaptive_bandwidth = TRUE, reliability_weighting = TRUE,
range_prediction_limits = FALSE, residual_cutoff = NULL,
enforce_indices = NULL, probs = c(0.05, 0.25, 0.5, 0.75, 0.95),
verbose = TRUE, allow_parallel = TRUE, blas_threads = 1L, ...)
## S3 method for class 'liblex'
plot(x, ...)
Arguments
Xr |
A numeric matrix of predictor variables with dimensions |
Yr |
A numeric vector or single-column matrix of length |
neighbors |
A neighbor selection object specifying how to select
neighbors. Use |
diss_method |
For
Default is For |
fit_method |
A |
anchor_indices |
An optional integer vector specifying row indices
of |
gh |
Logical indicating whether to compute the GH distance
(Mahalanobis distance in PLS score space) for each anchor observation.
Default is |
group |
An optional factor assigning group labels to observations in
|
control |
A list of control parameters created by |
verbose |
Logical indicating whether to display progress messages.
Default is |
... |
Additional arguments (currently unused). |
object |
A fitted object of class |
newdata |
A numeric matrix or data frame containing new predictor
values. Must include all predictors used in |
weighting |
Character string specifying the kernel weighting function
applied to neighbours when combining predictions. Options are:
|
adaptive_bandwidth |
Logical indicating whether to use adaptive
bandwidth for kernel weighting. When |
reliability_weighting |
Logical indicating whether to weight expert
predictions by their estimated reliability. When |
probs |
A numeric vector of probabilities in |
range_prediction_limits |
Logical. If |
residual_cutoff |
Numeric threshold for excluding models. Models with
absolute residuals exceeding this value are penalized during neighbour
selection. Default is |
enforce_indices |
Optional integer vector specifying model indices that
must always be included in each prediction neighborhood. These models are
assigned the minimum dissimilarity of the neighborhood to ensure selection.
Default is |
allow_parallel |
Logical indicating whether parallel computation is
permitted if a backend is registered. Default is |
blas_threads |
Integer specifying the number of BLAS threads to use.
Default is |
x |
An object of class |
Details
By default, local models are constructed for all n observations in the
reference set. Alternatively, specify a subset of m observations
(m < n) via anchor_indices to reduce computation.
Each local model uses neighbors selected from the full reference set, but models are only built for anchor observations. This is useful for large datasets where building models for all observations is computationally prohibitive.
When dissimilarity methods depend on Yr (e.g., PLS-based distances), the
response values of anchor observations are excluded during dissimilarity
computation for efficiency. However, anchor response values are always
used when fitting local models.
The number of anchors must not exceed 90% of nrow(Xr); to build models
for all observations, use anchor_indices = NULL.
Relationship between anchors and neighborhood size
The neighbors argument controls the neighborhood size (k) used both
for fitting local models and for retrieving experts during prediction.
When anchor_indices is specified, the number of available experts equals
the number of anchors. If max(k) exceeds the number of anchors and
tuning selects a large optimal k, prediction will retrieve fewer experts
than specified. For reliable predictions, ensure the number of anchors is
at least as large as the maximum k value being evaluated.
Missing values in Yr
Missing values in Yr are permitted. Observations with missing response
values can still serve as neighbors but are excluded from model fitting
as target observations.
GH distance
The GH distance is computed independently from diss_method using a PLS
projection with optimized component selection. This provides a measure of
how far each observation lies from the center of the reference set in the
PLS score space.
Validation and tuning
When control$mode = "validate" or control$tune = TRUE, nearest-neighbor
cross-validation is performed. For each anchor observation, its nearest
neighbor is excluded, a model is fitted on remaining neighbors, and the
excluded neighbor's response is predicted. This provides validation
statistics for parameter selection.
Prediction
For each observation in newdata, the predict method:
Computes dissimilarities to anchor observations (or their neighbourhood centres) stored in
object.Selects the
knearest neighbours based on the optimalkdetermined during model fitting.Applies kernel weighting based on dissimilarity.
Combines expert predictions using weighted averaging.
Kernel weighting functions
The weighting functions follow Cleveland and Devlin (1988). Let d be
the normalised dissimilarity (scaled to [0, 1] within the neighbourhood
when adaptive_bandwidth = TRUE). The available kernels are:
-
"gaussian":w = \exp(-d^2) -
"tricube":w = (1 - d^3)^3 -
"triweight":w = (1 - d^2)^3 -
"triangular":w = 1 - d -
"quartic":w = (1 - d^2)^2 -
"parabolic":w = 1 - d^2 -
"cauchy":w = 1 / (1 + d^2) -
"none":w = 1(equal weights)
Value
For liblex: A list of class "liblex" (when
control$mode = "build") or "liblex_validation" (when
control$mode = "validate") containing:
-
dissimilarity: List containing the dissimilarity method and matrix. -
fit_method: Fit constructor fromfit_method. -
gh: Ifgh = TRUE, a list with GH distances and the PLS projection. -
results: Data frame of validation statistics for each parameter combination (if validation was performed). -
best: The optimal parameter combination based oncontrol$metric. -
optimal_params: List with optimalkandncompvalues. -
residuals: Residuals from predictions using optimal parameters. -
coefficients: (Build mode only) List of regression coefficients:B0(intercepts),B(slopes). -
vips: (Build mode only) Variable importance in projection scores. -
selectivity_ratios: (Build mode only) Selectivity ratios for each predictor. -
scaling: (Build mode only) Centering and scaling vectors for prediction. -
neighborhood_stats: Statistics (response quantiles) for each neighborhood size. -
anchor_indices: The anchor indices used. -
neighbors: The object passed toneighbors.
For predict.liblex: A list with the following components:
-
predictions: A data frame containing:-
pred: Weighted mean predictions. -
pred_sd: Weighted standard deviation of expert predictions. -
q*: Weighted quantiles at probabilities specified byprobs. -
gh: Global Mahalanobis distance (if computed during fitting). -
min_yr: Minimum response value (5th percentile) across neighbours. -
max_yr: Maximum response value (95th percentile) across neighbours. -
below_min: Logical indicating prediction belowmin_yr. -
above_max: Logical indicating prediction abovemax_yr.
-
-
neighbors: A list with:-
indices: Matrix of neighbour indices (models) for each observation. -
dissimilarities: Matrix of corresponding dissimilarity scores.
-
-
expert_predictions: A list with:-
weights: Matrix of kernel weights applied to each expert. -
predictions: Matrix of raw predictions from each expert. -
weighted: Matrix of weighted predictions from each expert.
-
Author(s)
References
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596–610.
Naes, T., Isaksson, T., & Kowalski, B. (1990). Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry, 62(7), 664–673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. (2013). The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma, 195-196, 268-279.
Ramirez-Lopez, L., Metz, M., Lesnoff, M., Orellano, C., Perez-Fernandez, E., Plans, M., Breure, T., Behrens, T., Viscarra Rossel, R., & Peng, Y. (2026b). Rethinking local spectral modelling: From per-query refitting to model libraries. Analytica Chimica Acta, under review.
Rajalahti, T., Arneberg, R., Berven, F.S., Myhr, K.M., Ulvik, R.J., Kvalheim, O.M. (2009). Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemometrics and Intelligent Laboratory Systems, 95(1), 35-48.
See Also
liblex_control() for control parameters, neighbors_k() for neighborhood
specification, diss_pca(), diss_pls(), diss_correlation() for
dissimilarity methods, fit_pls(), fit_wapls() for fitting methods.
Examples
## Not run:
library(prospectr)
data(NIRsoil)
# Preprocess spectra
NIRsoil$spc_pr <- savitzkyGolay(
detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc))),
m = 1, p = 1, w = 7
)
# Missing values in the response are allowed
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1, ]
train_y <- NIRsoil$Ciso[NIRsoil$train == 1]
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0, ]
test_y <- NIRsoil$Ciso[NIRsoil$train == 0]
# Build library
model_library <- liblex(
Xr = train_x,
Yr = train_y,
neighbors = neighbors_k(c(30, 40)),
diss_method = diss_correlation(ws = 27, scale = TRUE),
fit_method = fit_wapls(
min_ncomp = 4,
max_ncomp = 17,
scale = FALSE,
method = "mpls"
),
control = liblex_control(tune = TRUE)
)
# Visualise neighborhood centroids and samples to predict
matplot(
as.numeric(colnames(model_library$scaling$local_x_center)),
t(test_x),
col = rgb(1, 0, 0, 0.3),
lty = 1,
type = "l",
xlab = "Wavelength (nm)",
ylab = "First derivative detrended absorbance"
)
matlines(
as.numeric(colnames(model_library$scaling$local_x_center)),
t(model_library$scaling$local_x_center),
col = rgb(0, 0, 1, 0.3),
lty = 1,
type = "l"
)
grid(lty = 1)
legend(
"topright",
legend = c("Samples to predict", "Neighborhood centroids"),
col = c(rgb(1, 0, 0, 0.8), rgb(0, 0, 1, 0.8)),
lty = 1,
lwd = 2,
bty = "n"
)
# Predict new observations
y_hat_liblex <- predict(model_library, test_x)
# Predicted versus observed values
lims <- range(y_hat_liblex$predictions$pred, test_y, na.rm = TRUE)
plot(
y_hat_liblex$predictions$pred,
test_y,
pch = 16,
col = rgb(0, 0, 0, 0.5),
xlab = "Predicted",
ylab = "Observed",
xlim = lims,
ylim = lims
)
abline(a = 0, b = 1, col = "red")
grid(lty = 1)
## run liblex in parallel (requires a parallel backend, e.g., doParallel)
library(doParallel)
n_cores <- min(2, parallel::detectCores() - 1)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
model_library2 <- liblex(
Xr = train_x,
Yr = train_y,
neighbors = neighbors_k(c(30, 40)),
fit_method = fit_wapls(min_ncomp = 4, max_ncomp = 17, method = "simpls")
)
y_hat_liblex2 <- predict(model_library2, test_x)
registerDoSEQ()
try(stopCluster(clust))
## End(Not run)
Control parameters for liblex
Description
Specifies control parameters for the liblex function, including
output options, validation settings, tuning behavior, and parallel execution.
Usage
liblex_control(
return_dissimilarity = FALSE,
mode = c("build", "validate"),
tune = FALSE,
metric = c("rmse", "r2"),
chunk_size = 1L,
allow_parallel = TRUE,
blas_threads = 1L
)
Arguments
return_dissimilarity |
A logical indicating whether the dissimilarity
matrix should be returned in the output. Default is |
mode |
A character string specifying the operation mode:
|
tune |
A logical indicating whether to optimize parameters via
nearest-neighbor validation. Default is |
metric |
A character string specifying the performance metric used
for parameter selection when
Ignored when |
chunk_size |
An integer specifying the number of local models to
process per parallel task. Default is |
allow_parallel |
A logical indicating whether parallel execution is
permitted. Default is |
blas_threads |
An integer specifying the number of threads for BLAS
operations. Default is |
Details
Nearest-neighbor validation
When tune = TRUE or mode = "validate", the function performs
nearest-neighbor validation (NNv) to assess model performance. For each
observation in the reference set (or anchor set, if specified), the
procedure:
Identifies the k nearest neighbors of the target observation.
Excludes the target observation (and any observations in the same group, if
groupis specified) from the neighbor set.Fits a local model using the remaining neighbors.
Predicts the response value for the excluded target observation.
Computes prediction errors across all observations.
This leave-one-out style validation provides an estimate of prediction
performance without requiring a separate test set. When tune = TRUE,
the parameter combination (number of neighbors and PLS component range)
yielding the best performance according to metric is selected for
building the final library.
Mode and tune combinations
mode | tune | Behavior |
"build" | FALSE | Build library using parameters as provided |
"build" | TRUE | Validate, find optimal parameters, build library |
"validate" | FALSE | Validate only, report performance statistics |
"validate" | TRUE | Validate, report statistics with optimal parameters identified |
Parallel chunk size
The chunk_size parameter controls granularity of parallel work
distribution. When allow_parallel = TRUE and a parallel backend is
registered:
-
chunk_size = 1: Each local model is a separate parallel task. Maximum parallelism but higher scheduling overhead. -
chunk_size > 1: Multiple models are processed sequentially within each parallel task. Reduces overhead and improves memory locality, but may cause load imbalance if the number of models is not evenly divisible.
When allow_parallel = FALSE, chunk_size has no effect.
Value
A list of class liblex_control containing the validated
control parameters.
Author(s)
See Also
Examples
# Default settings: build library without tuning
liblex_control()
# Tune parameters before building
liblex_control(tune = TRUE)
# Validate only for parameter exploration
liblex_control(mode = "validate", tune = TRUE)
# Include dissimilarity matrix in output
liblex_control(return_dissimilarity = TRUE)
# Larger chunks for reduced parallel overhead
liblex_control(chunk_size = 20L)
# Parallel settings for Linux with OpenBLAS
liblex_control(allow_parallel = TRUE, blas_threads = 1L)
Memory-based learning (mbl)
Description
Memory-based learning (a.k.a. instance-based learning or local regression) is a non-linear lazy learning approach for predicting a response variable from predictor variables. For each observation in a prediction set, a local regression is fitted using a subset of similar observations (nearest neighbors) from a reference set. This function does not produce a global model.
Usage
mbl(Xr, Yr, Xu, Yu = NULL,
neighbors,
diss_method = diss_pca(ncomp = ncomp_by_opc()),
diss_usage = c("none", "predictors", "weights"),
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
spike = NULL, group = NULL,
gh = FALSE,
control = mbl_control(),
verbose = TRUE, seed = NULL, ...)
## S3 method for class 'mbl'
plot(x, what = c("validation", "gh"), metric = "rmse", ncomp = c(1, 2), ...)
get_predictions(x)
## S3 method for class 'mbl'
plot(x, what = c("validation", "gh"), metric = "rmse", ncomp = c(1, 2), ...)
Arguments
Xr |
A matrix of predictor variables for the reference data (observations in rows, variables in columns). Column names are required. |
Yr |
A numeric vector or single-column matrix of response values
corresponding to |
Xu |
A matrix of predictor variables for the data to be predicted
(observations in rows, variables in columns). Must have the same column
names as |
Yu |
An optional numeric vector or single-column matrix of response
values corresponding to |
neighbors |
A neighbor selection object specifying how to select
neighbors. Use |
diss_method |
A dissimilarity method object or a precomputed dissimilarity matrix. Available constructors:
A precomputed matrix can also be passed. When |
diss_usage |
How dissimilarity information is used in local models:
|
fit_method |
A local fitting method object. Available constructors: |
spike |
An integer vector indicating indices of observations in
|
group |
An optional factor assigning group labels to |
gh |
Logical indicating whether to compute global Mahalanobis (GH)
distances. Default is |
control |
A list from |
verbose |
Logical indicating whether to display a progress bar.
Default is |
seed |
An integer for random number generation, enabling reproducible
cross-validation results. Default is |
... |
Additional arguments (currently unused). |
x |
An object of class |
what |
Character vector specifying what to plot. Options are
|
metric |
Character string specifying which validation statistic to plot.
Options are |
ncomp |
Integer vector of length 1 or 2 specifying which PLS components
to plot. Default is |
Details
Spiking
The spike argument forces specific reference observations into or out
of neighborhoods. Positive indices are always included; negative indices are
always excluded. When observations are forced in, the most distant neighbors
are displaced to maintain neighborhood size. See Guerrero et al. (2010).
Dissimilarity usage
When diss_usage = "predictors", the local dissimilarity matrix columns
are appended as additional predictor variables, which can improve predictions
(Ramirez-Lopez et al., 2013a).
When diss_usage = "weights", neighbors are weighted using a tricubic
function (Cleveland and Devlin, 1988; Naes et al., 1990):
where \(v = d(xr_i, xu_j) / \max(d)\).
GH distance
The global Mahalanobis distance (GH) measures how far each observation lies
from the center of the reference set. It is always computed using a PLS
projection with the number of components optimized via
ncomp_by_opc() (maximum 40 components or nrow(Xr),
whichever is smaller). This methodology is fixed and independent of the
diss_method specified for neighbor selection.
GH distances are useful for identifying extrapolation: observations with high GH values lie far from the calibration space and may yield unreliable predictions.
Grouping
The group argument enables leave-group-out cross-validation. When
validation_type = "local_cv" in mbl_control(), the
p parameter refers to the proportion of groups (not observations)
retained per iteration.
Deprecated arguments
The following arguments from previous versions of resemble are no
longer supported and will throw an error if used: k, k_diss,
k_range, method, pc_selection, center,
scale, and documentation. See the current argument list for
their replacements.
Value
mbl
For mbl(), a list of class mbl containing:
-
control: control parameters fromcontrol -
fit_method: fit constructor fromfit_method -
Xu_neighbors: list with neighbor indices and dissimilarities -
dissimilarities: dissimilarity method and matrix (ifreturn_dissimilarity = TRUEincontrol) -
n_predictions: number of predictions made -
gh: GH distances forXrandXu(ifgh = TRUE) -
validation_results: validation statistics by method -
results: list of data.frame objects with predictions, one per neighborhood size -
seed: the seed value used
Each results table contains:
-
o_index: observation index -
k: number of neighbors used -
k_diss,k_original: (neighbors_dissonly) threshold and original count -
ncomp: (fit_plsonly) number of PLS components -
min_ncomp,max_ncomp: (fit_waplsonly) component range -
yu_obs,pred: observed and predicted values -
yr_min_obs,yr_max_obs: response range in neighborhood -
index_nearest_in_Xr,index_farthest_in_Xr: neighbor indices -
y_nearest,y_farthest: neighbor response values -
diss_nearest,diss_farthest: neighbor dissimilarities -
y_nearest_pred: (NNv validation) leave-one-out prediction -
loc_rmse_cv,loc_st_rmse_cv: (local_cv validation) CV statistics -
loc_ncomp: (local dissimilarity only) components used locally
Get predictions
The get_predictions() function extracts predicted values from an
object of class mbl. It returns a data.frame containing the
predictions.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
References
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association 83:596-610.
Guerrero, C., Zornoza, R., Gomez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma 158:66-77.
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62:664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196:268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199:43-53.
Rasmussen, C.E., Williams, C.K. 2006. Gaussian Processes for Machine Learning. MIT Press.
Shenk, J., Westerhaus, M., Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy 5:223-232.
See Also
mbl_control, neighbors_k,
neighbors_diss, diss_pca, diss_pls,
fit_pls, fit_wapls, fit_gpr,
search_neighbors
Examples
## Not run:
library(prospectr)
data(NIRsoil)
# Preprocess: detrend + first derivative with Savitzky-Golay
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc))),
m = 1, p = 1, w = 7
)
NIRsoil$spc_pr <- sg_det
# Split data
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
# Example 1: Spectrum-based learner (Ramirez-Lopez et al., 2013)
ctrl <- mbl_control(validation_type = "NNv")
sbl <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_pca(ncomp = ncomp_by_opc(40)),
fit_method = fit_gpr(),
control = ctrl
)
sbl
plot(sbl)
get_predictions(sbl)
# Example 2: With known Yu
sbl_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
fit_method = fit_gpr(),
control = ctrl
)
plot(sbl_2)
# Example 3: LOCAL algorithm (Shenk et al., 1997)
local_algo <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_correlation(),
diss_usage = "none",
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
plot(local_algo)
# Example 4: Using dissimilarity as predictors
local_algo_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_pca(ncomp = ncomp_by_opc(40)),
diss_usage = "predictors",
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
plot(local_algo_2)
# Example 5: Parallel execution
library(doParallel)
n_cores <- min(2, parallel::detectCores() - 1)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
local_algo_par <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
neighbors = neighbors_k(seq(40, 140, by = 20)),
diss_method = diss_correlation(),
fit_method = fit_wapls(min_ncomp = 3, max_ncomp = 15),
control = ctrl
)
registerDoSEQ()
try(stopCluster(clust))
## End(Not run)
Control parameters for memory-based learning
Description
This function controls various aspects of the memory-based learning process
in the mbl function.
Usage
mbl_control(
return_dissimilarity = FALSE,
validation_type = "NNv",
tune_locally = TRUE,
number = 10,
p = 0.75,
range_prediction_limits = TRUE,
allow_parallel = TRUE,
blas_threads = 1L
)
Arguments
return_dissimilarity |
Logical indicating whether to return the
dissimilarity matrix between |
validation_type |
Character vector specifying validation method(s):
Multiple methods can be specified (e.g., |
tune_locally |
Logical indicating whether to tune PLS components
locally when |
number |
Integer specifying the number of sampling iterations for
|
p |
Numeric value between 0 and 1 indicating the proportion of
observations retained at each |
range_prediction_limits |
Logical indicating whether predictions should
be constrained to the range of response values in each neighborhood.
Default is |
allow_parallel |
Logical indicating whether parallel execution is
allowed via the foreach package. Default is |
blas_threads |
Integer specifying the number of BLAS threads to use
during |
Details
Validation methods
Leave-nearest-neighbor-out cross-validation ("NNv"):
For each target observation, the nearest neighbor is excluded from the
local model, which then predicts that neighbor's value. This is faster
than "local_cv". If the nearest neighbor belongs to a group
(specified via the group argument in mbl), all
group members are excluded.
Local leave-group-out cross-validation ("local_cv"):
The neighborhood is partitioned into subsets via stratified random
sampling. Each subset serves as validation data while the remainder
fits the model. This repeats number times, with p
controlling the training proportion. The final error is the average
local RMSE.
BLAS threading
On Linux systems with multi-threaded OpenBLAS, the default thread count
can cause significant overhead for algorithms like mbl() that
perform many small matrix operations. Setting blas_threads = 1
(the default) eliminates this overhead.
This setting requires the RhpcBLASctl package. If not installed,
the parameter is ignored and a message is displayed. The original
thread count is restored when mbl() completes.
Windows systems typically use single-threaded BLAS by default, so this setting has no effect there.
Value
A list of class mbl_control with the specified control parameters.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196:268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199:43-53.
See Also
mbl, neighbors_k,
neighbors_diss
Examples
# Default control parameters (NNv validation)
mbl_control()
# Both validation methods
mbl_control(validation_type = c("NNv", "local_cv"))
# No validation
mbl_control(validation_type = "none")
# NNv validation only, no parallel
mbl_control(validation_type = "NNv", allow_parallel = FALSE)
# Allow more BLAS threads (if needed for other computations)
mbl_control(blas_threads = 4)
Global spectral calibration model
Description
Fits a global calibration model for spectral data using partial least
squares (PLS) or Gaussian process regression (GPR). Unlike
mbl, which builds local models for each prediction,
model() fits a single global model to the reference data.
Usage
model(Xr, Yr, fit_method, control = model_control(), verbose = TRUE)
## S3 method for class 'resemble_model'
predict(object, newdata, ncomp = NULL, ...)
Arguments
Xr |
A numeric matrix of predictor variables, with observations in rows and variables in columns. Typically spectral data. |
Yr |
A numeric matrix with one column containing the response variable
values corresponding to the observations in |
fit_method |
An object created by |
control |
A list created by |
verbose |
Logical indicating whether progress information should be
printed. Default is |
object |
A |
newdata |
A numeric matrix of new observations with the same number of columns as the training data. |
ncomp |
For PLS models, the number of components to use for prediction. The default is the number used during fitting. Ignored for GPR models. For prediction, this can be a vector of integers representing the predictions coming from models with the requested components. |
... |
Additional arguments, currently unused. |
Details
model() provides a straightforward interface for fitting global
calibration models, in contrast to the local modelling approach implemented
in mbl. This is useful when:
the relationship between spectra and the response is approximately linear across the full dataset
a single portable model is needed for deployment
computational efficiency is prioritised over predictive performance in heterogeneous datasets
Cross-validation
When validation_type = "lgo", stratified random sampling is used
to create training-validation splits that preserve the distribution of the
response variable. Cross-validation results include RMSE, R², and
standardised RMSE for each component in PLS models, or overall in GPR
models.
PLS models
When fit_pls() is used, the function fits a PLS model with the
specified number of components. If cross-validation is enabled, the optimal
number of components is selected based on the minimum RMSE.
GPR models
When fit_gpr() is used, the function fits a Gaussian process
regression model with a dot-product covariance function. The noise variance
parameter controls regularisation.
Value
For model(), an object of class resemble_model containing:
-
fit_method: Fitting method object used. -
control: Model control object used. -
model: Fitted model object. -
cv_results: Cross-validation results, ifvalidation_type = "lgo". -
n_obs: Number of observations used. -
n_vars: Number of predictor variables.
For predict.resemble_model(), a numeric matrix of predictions. For
PLS models, columns correspond to different numbers of components.
Author(s)
References
de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251–263.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
Shenk, J. S., & Westerhaus, M. O. (1991). Population structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548–1555.
See Also
fit_pls and fit_gpr for specifying the fitting
method;
model_control for controlling aspects of the modelling
process;
mbl for memory-based (local) learning.
Examples
library(prospectr)
data(NIRsoil)
# Preprocess spectra
Xr <- savitzkyGolay(NIRsoil$spc, m = 1, p = 2, w = 11)
Yr <- NIRsoil$CEC
# Remove missing values
ok <- !is.na(Yr)
Xr <- Xr[ok, ]
Yr <- as.matrix(Yr[ok])
# Fit a PLS model with 10 components and cross-validation
# Scaling is controlled via fit_pls()
pls_mod <- model(
Xr = Xr,
Yr = Yr,
fit_method = fit_pls(ncomp = 10, scale = FALSE),
control = model_control(validation_type = "lgo", number = 10)
)
# View cross-validation results
pls_mod$cv_results
# Fit a GPR model (centring/scaling controlled via fit_gpr())
gpr_mod <- model(
Xr = Xr,
Yr = Yr,
fit_method = fit_gpr(noise_variance = 0.001, scale = TRUE),
control = model_control(validation_type = "lgo")
)
Control parameters for global model fitting
Description
Specifies cross-validation settings for the model function.
Usage
model_control(validation_type = c("lgo", "none"), number = 10L, p = 0.75)
Arguments
validation_type |
a character string specifying the validation method:
|
number |
an integer indicating the number of cross-validation
iterations. Only used when |
p |
a numeric value between 0 and 1 indicating the proportion of
observations to retain for training at each cross-validation iteration.
Only used when |
Value
A list of class "model_control" containing the specified
parameters.
Author(s)
See Also
Examples
# Default settings (leave-group-out CV with 10 iterations)
model_control()
# No cross-validation
model_control(validation_type = "none")
# Custom CV settings
model_control(validation_type = "lgo", number = 20, p = 0.80)
Component selection methods
Description
Constructor functions for specifying how to select the number of components
in projection-based dissimilarity methods (diss_pca(), diss_pls()).
Usage
ncomp_by_var(min_var = 0.01, max_ncomp = 40L)
ncomp_by_cumvar(min_cumvar = 0.99, max_ncomp = 40L)
ncomp_by_opc(max_ncomp = 40L)
ncomp_fixed(ncomp)
Arguments
min_var |
Numeric in (0, 1]. Minimum variance a single component must explain to be retained. |
max_ncomp |
Positive integer. Maximum number of components to compute or evaluate. |
min_cumvar |
Numeric in (0, 1]. Minimum cumulative variance that the retained components must explain. |
ncomp |
Positive integer. Exact number of components to use. |
Details
Four selection methods are available:
ncomp_by_var()Retains components that individually explain at least
min_varproportion of variance.ncomp_by_cumvar()Retains the minimum number of components whose combined explained variance reaches
min_cumvar.ncomp_by_opc()Optimized principal component selection based on side information (Ramirez-Lopez et al., 2013). The optimal number of components minimizes the RMSD between each observation's response and its nearest neighbor's response in the projected space. Requires
Yr.ncomp_fixed()Uses exactly
ncompcomponents with no automatic selection. Equivalent to passing an integer directly.
At runtime, max_ncomp is capped at min(max_ncomp, nrow(X), ncol(X)).
Value
An object of class "ncomp_selection" with a subclass indicating the
method:
-
ncomp_by_var: classc("ncomp_by_var", "ncomp_selection") -
ncomp_by_cumvar: classc("ncomp_by_cumvar", "ncomp_selection") -
ncomp_by_opc: classc("ncomp_by_opc", "ncomp_selection") -
ncomp_fixed: classc("ncomp_fixed", "ncomp_selection")
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
See Also
diss_pca(), diss_pls(), dissimilarity()
Examples
# Retain components explaining >= 1% variance each
ncomp_by_var(0.01)
# Retain enough components for 99% cumulative variance
ncomp_by_cumvar(0.99)
# Optimize using side information (requires Yr)
ncomp_by_opc(max_ncomp = 40)
# Fix at exactly 10 components
ncomp_fixed(10)
# Usage in dissimilarity constructors
diss_pca(ncomp = ncomp_by_var(0.01))
diss_pca(ncomp = ncomp_by_opc())
diss_pca(ncomp = 10)
Neighbor selection methods
Description
These functions create configuration objects that specify how neighbors
are selected for memory-based learning in mbl an liblex.
Usage
neighbors_k(k)
neighbors_diss(threshold, k_min = 4L, k_max = Inf)
Arguments
k |
Integer vector. One or more neighborhood sizes to evaluate. Values will be sorted in ascending order. Minimum allowed value is 4. |
threshold |
Numeric vector. One or more dissimilarity thresholds. Neighbors are selected if their dissimilarity to the target observation is below the threshold. Values will be sorted in ascending order. |
k_min |
Integer. Minimum number of neighbors to retain, regardless of
threshold. Default |
k_max |
Integer or |
Details
Two strategies are available for neighbor selection:
Fixed-k selection (neighbors_k)
A fixed number of nearest neighbors is selected for each target observation.
Multiple values of k can be provided to evaluate different
neighborhood sizes.
Dissimilarity-threshold selection (neighbors_diss)
Neighbors are selected based on a dissimilarity threshold. All reference
observations with dissimilarity below the threshold are included. The
k_min and k_max arguments provide bounds to ensure a reasonable
neighborhood size regardless of the threshold. Multiple thresholds can be
provided to evaluate different settings.
Value
An object of class c("neighbors_k", "neighbors") or
c("neighbors_diss", "neighbors"), containing the validated
parameters. Intended to be passed to mbl.
See Also
Examples
# Fixed neighborhood sizes
neighbors_k(k = 50)
neighbors_k(k = c(40, 60, 80, 100, 120))
# Dissimilarity threshold with default bounds
neighbors_diss(threshold = 0.3)
# Dissimilarity threshold with custom bounds
neighbors_diss(threshold = c(0.1, 0.2, 0.3), k_min = 10, k_max = 150)
Orthogonal projections
Description
Performs orthogonal projections of high-dimensional data matrices using principal component analysis (PCA) or partial least squares (PLS).
Usage
ortho_projection(
Xr, Xu = NULL, Yr = NULL,
ncomp = ncomp_by_var(0.01),
method = c("pca", "pca_nipals", "pls", "mpls", "simpls"),
center = TRUE,
scale = FALSE,
tol = 1e-6,
max_iter = 1000L,
pc_selection = deprecated(),
...
)
## S3 method for class 'ortho_projection'
predict(object, newdata, ...)
## S3 method for class 'ortho_projection'
plot(x, col = "#3B82F6", ...)
## S3 method for class 'ortho_projection'
predict(object, newdata, ...)
Arguments
Xr |
A numeric matrix of reference observations (rows) and variables (columns). |
Xu |
An optional matrix of additional observations to project. |
Yr |
An optional response matrix. Required for PLS methods
( |
ncomp |
Component selection method. Either:
Default is |
method |
A character string specifying the projection method:
|
center |
A logical indicating whether to center the data. Default is
|
scale |
A logical indicating whether to scale the data to unit
variance. Default is |
tol |
Convergence tolerance for the NIPALS algorithm. Default is
|
max_iter |
Maximum number of iterations for NIPALS. Default is
|
pc_selection |
|
... |
Additional arguments (currently unused). |
object |
Object of class |
newdata |
Matrix of new observations to project. |
x |
An object of class |
col |
Color for the plot elements. Default is |
Details
PCA methods
When method = "pca", singular value decomposition factorizes the
data matrix \(X\) as:
where \(U\) and \(V\) are orthogonal matrices (left and right singular vectors), and \(D\) is a diagonal matrix of singular values. The score matrix is \(UD\) and the loadings are \(V\).
When method = "pca_nipals", the non-linear iterative partial least
squares (NIPALS) algorithm is used instead.
PLS methods
Three PLS variants are available:
-
"pls": Standard PLS using the NIPALS algorithm with covariance-based weights. -
"mpls": Modified PLS using the NIPALS algorithm with correlation-based weights, giving equal influence to all predictors regardless of variance (Shenk and Westerhaus, 1991). -
"simpls": SIMPLS algorithm (de Jong, 1993), which deflates the cross-product matrix rather than X itself. Computationally faster than NIPALS, especially for wide matrices.
Component selection
When ncomp_by_opc() is used, component selection minimizes
RMSD (for continuous Yr) or maximizes kappa (for categorical
Yr) between observations and their nearest neighbors. See
diss_evaluate.
Value
An object of class "ortho_projection" containing:
-
scores: Matrix of projected scores forXr(andXu). -
X_loadings: Matrix of X loadings. -
Y_loadings: Matrix of Y loadings (PLS only). -
weights: Matrix of PLS weights (PLS only). -
projection_mat: Projection matrix for new data (PLS only). -
variance: List with original and explained variance. -
scores_sd: Standard deviation of scores. -
ncomp: Number of components retained. -
center: Centering vector used. -
scale: Scaling vector used. -
method: Projection method used. -
ncomp_method: The value passed to thencompargument. -
opc_evaluation: opc optimization results (if applicable).
Author(s)
References
de Jong, S. 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18:251-263.
Martens, H. 1991. Multivariate calibration. John Wiley & Sons.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196:268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199:43-53.
Shenk, J.S., Westerhaus, M.O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science 31:1548-1555.
See Also
ncomp_by_var, ncomp_by_opc,
diss_evaluate, mbl
Examples
library(prospectr)
data(NIRsoil)
# Preprocess
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc))),
m = 1, p = 1, w = 7
)
# Split data
train_x <- sg_det[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
test_x <- sg_det[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
# PCA with fixed components
proj <- ortho_projection(train_x, ncomp = 5)
plot(proj)
# PCA with variance-based selection
proj <- ortho_projection(train_x, ncomp = ncomp_by_var(0.01))
# PCA with OPC optimization
proj <- ortho_projection(train_x, Xu = test_x, Yr = train_y,
ncomp = ncomp_by_opc(40))
#' plot(proj)
# PLS projection (NIPALS)
proj <- ortho_projection(train_x, Xu = test_x, Yr = train_y,
method = "pls", ncomp = ncomp_by_opc(40))
# Modified PLS
proj <- ortho_projection(train_x, Yr = train_y,
method = "mpls", ncomp = 10)
# SIMPLS (faster for wide matrices)
proj <- ortho_projection(train_x, Yr = train_y,
method = "simpls", ncomp = 10)
Get the package version info
Description
returns package info.
Usage
pkg_info(pkg = "resemble")
Arguments
pkg |
the package name i.e "resemble" |
Print method for an object of class ortho_projection
Description
Prints the contents of an object of class ortho_projection
Usage
## S3 method for class 'ortho_projection'
print(x, ...)
Arguments
x |
an object of class |
... |
arguments to be passed to methods (not yet functional). |
Author(s)
Leonardo Ramirez-Lopez
Search neighbors in a reference set
Description
Searches for the nearest neighbors of observations in a reference set or between two sets of observations.
Usage
search_neighbors(Xr, Xu = NULL,
diss_method = diss_pca(), Yr = NULL,
neighbors, spike = NULL,
return_dissimilarity = FALSE,
k, k_diss, k_range, pc_selection,
center, scale, documentation, ...
)
Arguments
Xr |
A numeric matrix of reference observations (rows) and variables (columns) where the neighbor search is conducted. |
Xu |
Optional matrix of observations for which neighbors are to be
searched in |
diss_method |
A dissimilarity method object created by one of:
Default is |
Yr |
Optional response matrix. Required for PLS methods and when using
|
neighbors |
A neighbor selection object created by:
|
spike |
Optional integer vector indicating observations in |
return_dissimilarity |
Logical indicating whether to return the
dissimilarity matrix. Default is |
k |
Deprecated. |
k_diss |
Deprecated. |
k_range |
Deprecated. |
pc_selection |
Deprecated. |
center |
Deprecated. |
scale |
Deprecated. |
documentation |
Deprecated. |
... |
Additional arguments (currently unused). |
Details
This function is useful for reducing large reference sets by identifying
only relevant neighbors before running mbl.
If Xu is not provided, the function searches for neighbors within
Xr itself (excluding self-matches). If Xu is provided,
neighbors of each observation in Xu are searched in Xr.
The spike argument allows forcing specific observations into or out
of all neighborhoods. Positive indices are always included; negative indices
are always excluded.
Value
A list containing:
- neighbors
Matrix of
Xrindices for each query observation's neighbors, sorted by dissimilarity (columns = query observations).- neighbors_diss
Matrix of dissimilarity scores corresponding to
neighbors.- unique_neighbors
Vector of unique
Xrindices that appear in any neighborhood.- k_diss_info
If
neighbors_diss()was used, adata.framewith columns for observation index, number of neighbors found, and final number after applying bounds.- dissimilarity
If
return_dissimilarity = TRUE, the full dissimilarity object.- projection
If the dissimilarity method includes
return_projection = TRUE, the projection object.- gh
If the dissimilarity method includes
gh = TRUE, the GH distances.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J.A.M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
dissimilarity, mbl,
neighbors_k, neighbors_diss
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]
Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]
# Correlation-based neighbor search with k neighbors
ex1 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = diss_correlation(),
neighbors = neighbors_k(40)
)
# PCA-based with OPC selection
ex2 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = diss_pca(
ncomp = ncomp_by_opc(40),
scale = TRUE,
return_projection = TRUE
),
Yr = Yr,
neighbors = neighbors_k(50)
)
# Observations not in any neighborhood
setdiff(seq_len(nrow(Xr)), ex2$unique_neighbors)
# Dissimilarity threshold-based selection
ex3 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = diss_pls(
ncomp = ncomp_by_opc(40),
scale = TRUE
),
Yr = Yr,
neighbors = neighbors_diss(threshold = 0.5, k_min = 10, k_max = 100)
)
A function for computing the spectral information divergence between spectra (sid)
Description
This function computes the spectral information divergence/dissimilarity between spectra based on the kullback-leibler divergence algorithm (see details).
Usage
sid(Xr, Xu = NULL,
mode = "density",
center = FALSE, scale = FALSE,
kernel = "gaussian",
n = if(mode == "density") round(0.5 * ncol(Xr)),
bw = "nrd0",
reg = 1e-04,
...)
Arguments
Xr |
a matrix containing the spectral (reference) data. |
Xu |
an optional matrix containing the spectral data of a second set of observations. |
mode |
the method to be used for computing the spectral information
divergence. Options are |
center |
a logical indicating if the computations must be carried out on
the centred |
scale |
a logical indicating if the computations must be carried out on
the variance scaled |
kernel |
if |
n |
if |
bw |
if |
reg |
a numerical value larger than 0 which indicates a regularization parameter. Values (probabilities) below this threshold are replaced by this value for numerical stability. Default is 1e-4. |
... |
additional arguments to be passed to the
|
Details
This function computes the spectral information divergence (distance)
between spectra.
When mode = "density", the function first computes the probability
distribution of each spectrum which result in a matrix of density
distribution estimates. The density distributions of all the observations in
the datasets are compared based on the kullback-leibler divergence algorithm.
When mode = "feature", the kullback-leibler divergence between all
the observations is computed directly on the spectral variables.
The spectral information divergence (SID) algorithm (Chang, 2000) uses the
Kullback-Leibler divergence (\(KL\)) or relative entropy
(Kullback and Leibler, 1951) to account for the vis-NIR information provided
by each spectrum. The SID between two spectra (\(x_{i}\) and
\(x_{j}\)) is computed as follows:
where \(k\) represents the number of variables or spectral features, \(p\) and \(q\) are the probability vectors of \(x_{i}\) and \(x_{i}\) respectively which are calculated as:
\[p = \frac{x_i}{\sum_{l=1}^{k} x_{i,l}}\] \[q = \frac{x_j}{\sum_{l=1}^{k} x_{j,l}}\]From the above equations it can be seen that the original SID algorithm
assumes that all the components in the data matrices are nonnegative.
Therefore centering cannot be applied when mode = "feature". If a
data matrix with negative values is provided and mode = "feature",
the sid function automatically scales the matrix as follows:
or
\[X_{s} = \frac{X-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\] \[Xu_{s} = \frac{Xu-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]if Xu is specified. The 0 values are replaced by a regularization
parameter (reg argument) for numerical stability.
The default of the sid function is to compute the SID based on the
density distributions of the spectra (mode = "density"). For each
spectrum in X the density distribution is computed using the
density function of the stats package.
The 0 values of the estimated density distributions of the spectra are
replaced by a regularization parameter ("reg" argument) for numerical
stability. Finally the divergence between the computed spectral histogramas
is computed using the SID algorithm. Note that if mode = "density",
the sid function will accept negative values and matrix centering
will be possible.
Value
a list with the following components:
sid: if only"X"is specified (i.e.Xu = NULL), a square symmetric matrix of SID distances between all the components in"X". If both"X"and"Xu"are specified, a matrix of SID distances between the components in"X"and the components in"Xu") where the rows represent the objects in"X"and the columns represent the objects in"Xu"Xr: the (centered and/or scaled if specified) spectralXmatrixXu: the (centered and/or scaled if specified) spectralXumatrixdensityDisXr: ifmode = "density", the computed density distributions ofXrdensityDisXu: ifmode = "density", the computed density distributions ofXu
Author(s)
Leonardo Ramirez-Lopez
References
Chang, C.I. 2000. An information theoretic-based approach to spectral variability, similarity and discriminability for hyperspectral image analysis. IEEE Transactions on Information Theory 46, 1927-1932.
See Also
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]
# Example 1
# Compute the SID distance between all the observations in Xr
xr_sid <- sid(Xr)
xr_sid
# Example 2
# Compute the SID distance between the observations in Xr and the observations
# in Xu
xr_xu_sid <- sid(Xr, Xu)
xr_xu_sid