| Type: | Package |
| Title: | Methods and Reproducible Workflows for Partial Least Squares with Missing Data |
| Version: | 0.2.0 |
| Date: | 2026-04-07 |
| Depends: | R (≥ 4.1.0) |
| Imports: | mice, plsRglm, stats, utils, VIM |
| Suggests: | bcv, knitr, mlbench, plsdof, rmarkdown, testthat (≥ 3.0.0) |
| Author: | Titin Agustin Nengsih [aut], Frederic Bertrand [aut, cre], Myriam Maumy-Bertrand [aut] |
| Maintainer: | Frederic Bertrand <frederic.bertrand@lecnam.net> |
| Description: | Methods-first tooling for reproducing and extending the partial least squares regression studies on incomplete data described in Nengsih et al. (2019) <doi:10.1515/sagmb-2018-0059>. The package provides simulation helpers, missingness generators, imputation wrappers, component-selection utilities, real-data diagnostics, and reproducible study orchestration for Nonlinear Iterative Partial Least Squares (NIPALS)-Partial Least Squares (PLS) workflows. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| RoxygenNote: | 7.3.3 |
| URL: | https://fbertran.github.io/missPLS/, https://github.com/fbertran/missPLS |
| BugReports: | https://github.com/fbertran/missPLS/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-04-08 09:17:07 UTC; bertran7 |
| Repository: | CRAN |
| Date/Publication: | 2026-04-13 15:00:02 UTC |
missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data
Description
Methods-first tooling for reproducing and extending the partial least squares regression studies on incomplete data described in Nengsih et al. (2019) doi:10.1515/sagmb-2018-0059. The package provides simulation helpers, missingness generators, imputation wrappers, component-selection utilities, real-data diagnostics, and reproducible study orchestration for Nonlinear Iterative Partial Least Squares (NIPALS)-Partial Least Squares (PLS) workflows.
Author(s)
Maintainer: Frederic Bertrand frederic.bertrand@lecnam.net
Authors:
Titin Agustin Nengsih
Myriam Maumy-Bertrand
See Also
Useful links:
Report bugs at https://github.com/fbertran/missPLS/issues
Add missing values to a predictor matrix
Description
Create MCAR or MAR missingness on the predictor matrix x. Missingness is
generated column-wise so that each predictor receives approximately the same
missing-data proportion, matching the simulation strategy used in the
original work.
Usage
add_missingness(
x,
y,
mechanism = c("MCAR", "MAR"),
missing_prop,
seed = NULL,
mar_y_bias = 0.8
)
Arguments
x |
Predictor matrix or data frame. |
y |
Numeric response vector. |
mechanism |
Missingness mechanism: |
missing_prop |
Missing-data proportion as a fraction ( |
seed |
Optional random seed. If supplied, it is used only for this call. |
mar_y_bias |
Proportion of missing values assigned to the upper
half of the observed |
Value
A list with components x_incomplete, missing_mask,
missing_prop, mechanism, and seed.
Examples
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1)
miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2)
mean(is.na(miss$x_incomplete))
Bromhexine dataset
Description
Bromhexine in pharmaceutical syrup used in the article and thesis.
Usage
bromhexine
Format
A misspls_dataset list with components:
- name
Dataset name.
- x
A numeric
23 x 64predictor matrix.- y
A numeric response vector of length
23.- data
A data frame with response
yand predictorsx1tox64.- source
A short source reference.
- preprocessing
Dataset preprocessing notes.
- notes
Additional study notes.
Source
Goicoechea and Olivieri (1999a), calibration and test files bundled in
extra_docs/pls_data.
Diagnose a real dataset
Description
Compute correlation summaries and VIF-style diagnostics for a packaged real dataset.
Usage
diagnose_real_data(dataset, cor_threshold = 0.7)
Arguments
dataset |
A packaged dataset name or |
cor_threshold |
Absolute-correlation threshold used when reporting predictor pairs and predictor-response associations. |
Value
A list with correlation and VIF summaries.
Examples
diag_bromhexine <- diagnose_real_data("bromhexine")
names(diag_bromhexine)
Impute a predictor matrix
Description
Apply one of the imputation strategies used in the article and thesis.
Usage
impute_pls_data(
x,
method = c("mice", "knn", "svd"),
seed = NULL,
m,
k = 15L,
svd_rank = 10L,
svd_maxiter = 1000L
)
Arguments
x |
Incomplete predictor matrix or data frame. |
method |
Imputation method: |
seed |
Optional random seed forwarded to stochastic imputers when supported. |
m |
Number of imputations for |
k |
Number of neighbours for |
svd_rank |
Target rank for |
svd_maxiter |
Maximum number of iterations for the fallback SVD routine. |
Value
A misspls_imputation object.
Examples
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1)
miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2)
imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3)
length(imp$datasets)
Octane dataset
Description
Octane in gasoline from NIR data used in the article and thesis.
Usage
octane
Format
A misspls_dataset list with components:
- name
Dataset name.
- x
A numeric
68 x 493predictor matrix.- y
A numeric response vector of length
68.- data
A data frame with response
yand predictorsx1tox493.- source
A short source reference.
- preprocessing
Dataset preprocessing notes.
- notes
Additional study notes.
Source
Goicoechea and Olivieri (2003), calibration and test files bundled in
extra_docs/pls_data.
Complete-case ozone dataset
Description
Los Angeles ozone pollution complete-case dataset used in the article and thesis.
Usage
ozone_complete
Format
A misspls_dataset list with components:
- name
Dataset name.
- x
A numeric
203 x 12predictor matrix.- y
A numeric response vector of length
203.- data
A data frame with response
yand predictorsx1tox12.- source
A short source reference.
- preprocessing
Dataset preprocessing notes.
- notes
Additional study notes.
Source
mlbench::Ozone, restricted to the 203 complete observations used
in the published analysis.
Run a real-data study
Description
Run a real-data study
Usage
run_real_data_study(
dataset,
seed = NULL,
missing_props = seq(5, 50, 5),
mechanisms = c("MCAR", "MAR"),
reps = 1L,
baseline_reps = 100L,
max_ncomp = 12L,
criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
incomplete_methods = c("nipals_standard", "nipals_adaptative"),
imputation_methods = c("mice", "knn", "svd"),
folds = 10L,
mar_y_bias = 0.8
)
Arguments
dataset |
A packaged dataset name or |
seed |
Optional base random seed. |
missing_props |
Missing-data proportions as fractions or percentages. |
mechanisms |
Missing-data mechanisms. |
reps |
Number of replicate missingness draws for each mechanism and proportion. |
baseline_reps |
Number of repeated complete-data |
max_ncomp |
Maximum number of extracted components. |
criteria |
Criteria evaluated on incomplete and imputed data. |
incomplete_methods |
Incomplete-data NIPALS workflows. |
imputation_methods |
Imputation methods. |
folds |
Number of folds used by |
mar_y_bias |
MAR bias parameter passed to |
Value
A data frame with one row per study run.
Run a simulation study
Description
Run the simulation workflows used in the article and thesis.
Usage
run_simulation_study(
dimensions = list(c(500L, 100L), c(500L, 20L), c(100L, 20L), c(80L, 25L), c(60L, 33L),
c(40L, 50L), c(20L, 100L)),
true_ncomp = c(2L, 4L, 6L),
missing_props = seq(5, 50, 5),
mechanisms = c("MCAR", "MAR"),
reps = 1L,
seed = NULL,
max_ncomp = 8L,
criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
incomplete_methods = c("nipals_standard", "nipals_adaptative"),
imputation_methods = c("mice", "knn", "svd"),
folds = 10L,
mar_y_bias = 0.8
)
Arguments
dimensions |
List of |
true_ncomp |
Vector of true component counts. |
missing_props |
Missing-data proportions as fractions or percentages. |
mechanisms |
Missing-data mechanisms. |
reps |
Number of replicates. |
seed |
Optional base random seed. |
max_ncomp |
Maximum number of extracted components. |
criteria |
Criteria evaluated on complete and imputed data. |
incomplete_methods |
Incomplete-data NIPALS workflows. |
imputation_methods |
Imputation methods. |
folds |
Number of folds used by |
mar_y_bias |
MAR bias parameter passed to |
Value
A data frame with one row per study run.
Select the number of PLS components
Description
Select the number of components for complete, imputed, or incomplete-data PLS workflows.
Usage
select_ncomp(
x,
y,
method = c("complete", "nipals_standard", "nipals_adaptative"),
criterion = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
max_ncomp,
seed = NULL,
folds = 10L,
threshold = 0.0975
)
Arguments
x |
Predictor matrix, dataset object, or |
y |
Numeric response vector. This may be omitted when |
method |
Selection workflow: |
criterion |
Selection criterion: |
max_ncomp |
Maximum number of components to consider. |
seed |
Optional random seed used by the cross-validation and imputation aggregation steps. |
folds |
Number of cross-validation folds used by |
threshold |
Threshold applied to |
Value
A one-row data frame describing the selected component count.
Examples
sim <- simulate_pls_data(n = 25, p = 10, true_ncomp = 2, seed = 1)
select_ncomp(sim$x, sim$y, method = "complete", criterion = "AIC", max_ncomp = 4, seed = 2)
Simulate PLS data
Description
Simulate a univariate-response PLS dataset using the Li et al.-style
generator available in plsRglm.
Usage
simulate_pls_data(n, p, true_ncomp, seed = NULL, model = "li2002")
Arguments
n |
Number of observations. |
p |
Number of predictors. |
true_ncomp |
True number of latent components. |
seed |
Optional random seed. If supplied, it is used only for this call. |
model |
Simulation model. Only |
Value
A list with components x, y, data, true_ncomp, seed, and
model.
Examples
sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 42)
str(sim)
Summarize simulation or real-data study results
Description
Summarize simulation or real-data study results
Usage
summarize_simulation_study(results)
Arguments
results |
A results data frame returned by |
Value
A grouped summary data frame.
Examples
sim_results <- run_simulation_study(
dimensions = list(c(30, 12)),
true_ncomp = 2,
missing_props = numeric(0),
mechanisms = character(0),
reps = 2,
seed = 1
)
summarize_simulation_study(sim_results)
Tetracycline dataset
Description
Tetracycline in serum used in the article and thesis.
Usage
tetracycline
Format
A misspls_dataset list with components:
- name
Dataset name.
- x
A numeric
107 x 101predictor matrix.- y
A numeric response vector of length
107.- data
A data frame with response
yand predictorsx1tox101.- source
A short source reference.
- preprocessing
Dataset preprocessing notes.
- notes
Additional study notes.
Source
Goicoechea and Olivieri (1999b), calibration and test files bundled in
extra_docs/pls_data.