This vignette illustrates how to work with HDF5-backed matrices using the S3 interface provided by the BigDataStatMeth package. It covers matrix creation, data import, subsetting, algebraic and statistical operations, matrix decompositions, global options, compression, HDF5 file-space management, and resource handling.
BigDataStatMeth provides statistical and linear algebra operations for
matrices stored in HDF5 files. The package is designed for workflows in
which matrices may be too large to be held entirely in memory, while
still allowing users to work with familiar R functions.
The examples in this vignette use small matrices so that they can be run quickly and compared with standard in-memory R results. The same interface is intended for larger HDF5-backed matrices. All files created in the examples are written to temporary locations.
library(BigDataStatMeth)
The recommended user-facing interface is based on HDF5Matrix objects
and standard R generics. Thus, HDF5-backed matrices can be manipulated
using calls such as dim(), [, %*%, crossprod(), scale(),
cor(), svd(), qr(), and prcomp().
Internally, BigDataStatMeth uses an R6 and C++ backend. Advanced users
may interact with lower-level interfaces when needed, but this vignette
focuses on the S3 interface because it allows code to remain close to
ordinary R matrix workflows.
The following table summarizes the main user-facing functionality
available for HDF5Matrix objects. It is not an exhaustive list of all
exported functions; rather, it highlights the standard R-style methods
and high-level helpers used throughout this vignette. Additional
domain-specific, lower-level, and compatibility functions are documented
in their corresponding help pages.
| Category | Representative calls |
|---|---|
| Core object handling | hdf5_create_matrix(), hdf5_matrix(), dim(), nrow(), ncol(), is_open(), close() |
| HDF5 inspection and I/O | list_datasets(), hdf5_import(), hdf5_import_multiple(), as.matrix(), as.data.frame() |
| Subsetting and assignment | X[i, j], X[i, j] <- value |
| Dimension names | rownames(), colnames(), dimnames() |
| Element-wise arithmetic | X + Y, X - Y, X * Y, X / Y |
| Matrix algebra | %*%, crossprod(), tcrossprod() |
| Binding | cbind(), rbind() |
| Aggregations | colSums(), rowSums(), colMeans(), rowMeans(), colVars(), rowVars(), colSds(), rowSds(), colMins(), rowMins(), colMaxs(), rowMaxs() |
| Scalar summaries | mean(), var(), sd() |
| Normalization and transformations | scale(), sweep() |
| Correlation | cor() |
| Matrix decompositions | svd(), prcomp(), eigen(), pseudoinverse() |
| Factorizations and solvers | qr(), chol(), solve() |
| Diagonal operations | diag(), diag<-(), diag_op(), diag_scale() |
| Split, reduce, and apply | split_dataset(), split(), reduce(), apply_function() |
| Resource management and options | hdf5matrix_options(), show_hdf5matrix_options(), hdf5_close_all() |
| Specialized high-level helpers | selected bd* functions for utilities without a direct standard R generic, such as bdCreate_hdf5_group(), bdmove_hdf5_dataset(), and bdWrite_hdf5_dimnames() |
Most examples in this vignette use the standard HDF5Matrix/S3
interface. Some high-level helper functions keep the bd* prefix
because they provide specialized functionality that does not map
directly to an existing base R generic. These functions remain part of
the package API and are documented in their corresponding help pages.
Some S3 operators, such as element-wise arithmetic, follow the standard R syntax and do not expose additional arguments in the function call. Global options make it possible to configure block-wise processing, parallelization, the number of threads, and compression for such operations.
old_opts <- hdf5matrix_options()
old_opts
$paral
NULL
$block_size
NULL
$threads
NULL
$compression
NULL
For example, the following settings enable parallel execution with two threads, use a fixed block size, and set the compression level used for new output datasets when not explicitly specified by a method call.
hdf5matrix_options(
paral = TRUE,
threads = 2L,
block_size = 512L,
compression = 6L
)
hdf5matrix_options()
$paral
[1] TRUE
$block_size
[1] 512
$threads
[1] 2
$compression
[1] 6
When an individual method provides explicit arguments, those arguments
should be used for local control of that operation. The global options
are especially useful for generic operators where the standard R call is
kept unchanged. Additional operation-specific parameters are documented
in the corresponding help pages, for example ?svd.HDF5Matrix,
?prcomp.HDF5Matrix, ?qr.HDF5Matrix, and ?hdf5matrix_options.
The function hdf5_create_matrix() creates an HDF5 dataset and returns
an HDF5Matrix object pointing to it. In the example below, a standard
R matrix is written to an HDF5 file and then manipulated through the S3
interface.
h5file <- tempfile(fileext = ".h5")
set.seed(123)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)
X_h5 <- hdf5_create_matrix(
filename = h5file,
dataset = "data/X",
data = X,
overwrite = TRUE
)
X_h5
HDF5Matrix object
File: /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d4ed7d48.h5
Path: data/X
Dimensions: 500 x 100
Type:
Status: OPEN
dim(X_h5)
[1] 500 100
nrow(X_h5)
[1] 500
ncol(X_h5)
[1] 100
The datasets stored in an HDF5 file can be listed with
list_datasets(). Existing datasets can be reopened with
hdf5_matrix().
list_datasets(h5file)
[1] "data/X"
X_h5_reopened <- hdf5_matrix(
filename = h5file,
path = "data/X"
)
dim(X_h5_reopened)
[1] 500 100
Operations that return an HDF5Matrix object write their results to an
HDF5 dataset. The printed object shows the file and dataset path. Some
methods, such as crossprod() and tcrossprod(), allow the output group
and dataset name to be specified explicitly. Other S3 operators, such as
element-wise arithmetic, currently use package-defined output names.
Delimited text files can be imported directly into HDF5 with
hdf5_import(). The following example uses the small cholesterol data
file distributed with the package.
In the final package layout, the file should be available under
inst/extdata/colesterol.csv. The fallback below is included only to
allow this vignette to be tested against older source layouts in which
the same file was stored elsewhere.
csv_file <- system.file(
"extdata", "colesterol.csv",
package = "BigDataStatMeth"
)
if (!nzchar(csv_file)) {
csv_file <- system.file(
"data", "colesterol.csv",
package = "BigDataStatMeth"
)
}
if (!nzchar(csv_file) && file.exists("colesterol.csv")) {
csv_file <- "colesterol.csv"
}
stopifnot(nzchar(csv_file))
h5_csv <- tempfile(fileext = ".h5")
cholesterol_h5 <- hdf5_import(
source = csv_file,
filename = h5_csv,
dataset = "cholesterol/data",
sep = ",",
header = TRUE,
overwrite = TRUE
)
cholesterol_h5
HDF5Matrix object
File: /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d62902b18.h5
Path: cholesterol/data
Dimensions: 1000 x 10
Type:
Status: OPEN
dim(cholesterol_h5)
[1] 1000 10
cholesterol_h5[1:5, 1:min(6, ncol(cholesterol_h5))]
TCholesterol Age Insulin Creatinine BUN LLDR
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655
The [ operator can be used to read subsets from an HDF5Matrix
object. Only the requested rows and columns are read.
X_h5[1:5, 1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.56047565 -0.6018928 -0.99579872 -0.8209867 -0.5116037 -0.6788076
[2,] -0.23017749 -0.9936986 -1.03995504 -0.3072572 0.2369379 0.5743127
[3,] 1.55870831 1.0267851 -0.01798024 -0.9020980 -0.5415892 -0.7045145
[4,] 0.07050839 0.7510613 -0.13217513 0.6270687 1.2192276 -0.5339841
[5,] 0.12928774 -1.5091665 -2.54934277 1.1203550 0.1741359 0.7743846
sub_X <- X_h5[1:20, 1:10]
dim(sub_X)
[1] 20 10
The replacement form [<- writes values back to the HDF5 dataset.
Changes are applied directly to the data stored on disk.
X_edit <- hdf5_create_matrix(
h5file,
"data/X_edit",
data = X[1:10, 1:6],
overwrite = TRUE
)
X_edit[1, 1] <- 999
X_edit[1:3, 1:3]
[,1] [,2] [,3]
[1,] 999.0000000 -0.6018928 -0.99579872
[2,] -0.2301775 -0.9936986 -1.03995504
[3,] 1.5587083 1.0267851 -0.01798024
Dimension names can also be stored and retrieved with the usual R accessors.
DN_h5 <- hdf5_create_matrix(
h5file,
"data/dimnames_example",
data = matrix(seq_len(12), nrow = 4, ncol = 3),
overwrite = TRUE
)
rownames(DN_h5) <- paste0("sample_", seq_len(nrow(DN_h5)))
colnames(DN_h5) <- paste0("feature_", seq_len(ncol(DN_h5)))
rownames(DN_h5)
[1] "sample_1" "sample_2" "sample_3" "sample_4"
colnames(DN_h5)
[1] "feature_1" "feature_2" "feature_3"
dimnames(DN_h5)
[[1]]
[1] "sample_1" "sample_2" "sample_3" "sample_4"
[[2]]
[1] "feature_1" "feature_2" "feature_3"
The function as.matrix() reads an HDF5-backed matrix into memory. It
is useful for small examples or final results, but should be used with
care for very large datasets.
X_small <- as.matrix(X_h5[1:10, 1:5])
X_small
[,1] [,2] [,3] [,4] [,5]
[1,] -0.56047565 -0.60189285 -0.99579872 -0.8209867 -0.5116037
[2,] -0.23017749 -0.99369859 -1.03995504 -0.3072572 0.2369379
[3,] 1.55870831 1.02678506 -0.01798024 -0.9020980 -0.5415892
[4,] 0.07050839 0.75106130 -0.13217513 0.6270687 1.2192276
[5,] 0.12928774 -1.50916654 -2.54934277 1.1203550 0.1741359
[6,] 1.71506499 -0.09514745 1.04057346 2.1272136 -0.6152683
[7,] 0.46091621 -0.89594782 0.24972574 0.3661144 -1.8068930
[8,] -1.26506123 -2.07075107 2.41620737 -0.8747814 -0.6436811
[9,] -0.68685285 0.15012013 0.68519824 1.0244749 2.0460189
[10,] -0.44566197 -0.07921171 -0.44695931 0.9047589 -0.5607624
Element-wise arithmetic between HDF5Matrix objects can be performed
using the standard arithmetic operators. The results are stored as new
HDF5-backed matrices.
set.seed(1)
A <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
B <- matrix(rnorm(300 * 80), nrow = 300, ncol = 80)
C <- matrix(rnorm(80 * 40), nrow = 80, ncol = 40)
A_h5 <- hdf5_create_matrix(
h5file, "data/A", data = A,
overwrite = TRUE
)
B_h5 <- hdf5_create_matrix(
h5file, "data/B", data = B,
overwrite = TRUE
)
C_h5 <- hdf5_create_matrix(
h5file, "data/C", data = C,
overwrite = TRUE
)
S_h5 <- A_h5 + B_h5
D_h5 <- A_h5 - B_h5
S_h5
HDF5Matrix object
File: /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d4ed7d48.h5
Path: OUTPUT/A_plus_B
Dimensions: 300 x 80
Type:
Status: OPEN
dim(S_h5)
[1] 300 80
all.equal(as.matrix(S_h5), A + B)
[1] TRUE
all.equal(as.matrix(D_h5), A - B)
[1] TRUE
The output datasets can be inspected directly from the file. This is useful when results need to be reopened later in the workflow.
list_datasets(h5file, group = "OUTPUT", recursive = TRUE)
[1] "A_minus_B" "A_plus_B"
Matrix multiplication uses the same %*% syntax as ordinary R matrices,
while the computation is performed block-wise on the HDF5-backed data.
M_h5 <- A_h5 %*% C_h5
M_h5
HDF5Matrix object
File: /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d4ed7d48.h5
Path: OUTPUT/A_x_C
Dimensions: 300 x 40
Type:
Status: OPEN
dim(M_h5)
[1] 300 40
all.equal(as.matrix(M_h5), A %*% C)
[1] TRUE
Crossproducts are available through the standard R calls. These methods also allow the output group and dataset name to be specified explicitly.
XtX_h5 <- crossprod(
A_h5,
outgroup = "RESULTS",
outdataset = "A_crossprod"
)
XXt_h5 <- tcrossprod(
A_h5,
outgroup = "RESULTS",
outdataset = "A_tcrossprod"
)
XtX_h5
HDF5Matrix object
File: /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d4ed7d48.h5
Path: RESULTS/A_crossprod
Dimensions: 80 x 80
Type:
Status: OPEN
list_datasets(h5file, group = "RESULTS", recursive = TRUE)
[1] "A_crossprod" "A_tcrossprod"
all.equal(as.matrix(XtX_h5), crossprod(A))
[1] TRUE
all.equal(as.matrix(XXt_h5), tcrossprod(A))
[1] TRUE
The functions cbind() and rbind() can be used to combine compatible
HDF5-backed matrices by columns or rows.
A1_h5 <- hdf5_create_matrix(
h5file,
"bind/A1",
data = A[1:50, 1:10],
overwrite = TRUE
)
A2_h5 <- hdf5_create_matrix(
h5file,
"bind/A2",
data = A[1:50, 11:20],
overwrite = TRUE
)
Cbind_h5 <- cbind(
A1_h5, A2_h5,
out_group = "BIND_RESULTS",
out_dataset = "A1_A2_cbind",
overwrite = TRUE
)
Rbind_h5 <- rbind(
A1_h5, A1_h5,
out_group = "BIND_RESULTS",
out_dataset = "A1_A1_rbind",
overwrite = TRUE
)
dim(Cbind_h5)
[1] 50 20
dim(Rbind_h5)
[1] 100 10
all.equal(as.matrix(Cbind_h5), cbind(A[1:50, 1:10], A[1:50, 11:20]))
[1] TRUE
all.equal(as.matrix(Rbind_h5), rbind(A[1:50, 1:10], A[1:50, 1:10]))
[1] TRUE
Many summary operations return standard R vectors or scalars. This is convenient when the output is small relative to the input matrix.
all.equal(colMeans(A_h5), colMeans(A))
[1] TRUE
all.equal(rowSums(A_h5), rowSums(A))
[1] TRUE
all.equal(colVars(A_h5), apply(A, 2, var))
[1] TRUE
all.equal(rowSds(A_h5), apply(A, 1, sd))
[1] TRUE
The scale() method centers and/or scales an HDF5-backed matrix by
blocks and stores the result on disk.
A_scaled_h5 <- scale(A_h5)
A_scaled <- scale(A)
all.equal(
as.matrix(A_scaled_h5),
A_scaled,
check.attributes = FALSE
)
[1] TRUE
Correlation matrices can be computed directly from an HDF5Matrix.
Cor_h5 <- cor(A_h5)
all.equal(
as.matrix(Cor_h5),
cor(A),
tolerance = 1e-6
)
[1] TRUE
Some operations can be combined with small HDF5-backed vectors. The
following example subtracts column means from each column using
sweep().
col_means_h5 <- hdf5_create_matrix(
h5file,
"stats/A_col_means",
data = matrix(colMeans(A), nrow = 1),
overwrite = TRUE
)
A_centered_h5 <- sweep(A_h5, MARGIN = 2, STATS = col_means_h5, FUN = "-")
all.equal(
as.matrix(A_centered_h5),
sweep(A, MARGIN = 2, STATS = colMeans(A), FUN = "-"),
check.attributes = FALSE
)
[1] TRUE
BigDataStatMeth implements several matrix decompositions for
HDF5-backed matrices. The examples below use small matrices to validate
results against standard R computations.
The SVD method can center and scale the input before decomposition. The returned singular values are stored in memory, while singular vectors are stored as HDF5-backed matrices.
set.seed(123)
X_svd <- matrix(rnorm(120 * 300), nrow = 120, ncol = 300)
X_svd_h5 <- hdf5_create_matrix(
h5file,
"decomp/X_svd",
data = X_svd,
overwrite = TRUE
)
svd_h5 <- svd(
X_svd_h5,
nu = 10,
nv = 10,
center = TRUE,
scale = TRUE,
overwrite = TRUE
)
head(svd_h5$d)
[1] 27.48157 27.27928 26.93254 26.60646 26.34913 25.95616
dim(svd_h5$u)
[1] 120 10
dim(svd_h5$v)
[1] 300 10
The first singular values can be compared with the corresponding in-memory R calculation.
svd_r <- svd(scale(X_svd), nu = 10, nv = 10)
all.equal(
svd_h5$d[1:10],
svd_r$d[1:10],
tolerance = 1e-6
)
[1] TRUE
For large matrices, svd() can use a block-wise strategy. The parameter
k controls the number of local SVDs per incremental level, whereas
q controls the number of incremental levels used to combine
intermediate decompositions. Larger values may reduce the size of each
local problem, but they can also increase overhead. Therefore, these
parameters should be selected according to the matrix dimensions and the
available hardware. The argument threads controls the number of
threads used by parallel parts of the computation.
The arguments nu and nv follow the same idea as in base::svd():
nu controls how many left singular vectors are returned, and nv
controls how many right singular vectors are returned. Requesting only
the leading components is useful when the complete set of singular
vectors is not needed.
svd_blk_h5 <- svd(
X_svd_h5,
nu = 5,
nv = 5,
k = 4,
q = 1,
threads = 2,
overwrite = TRUE
)
head(svd_blk_h5$d)
[1] 27.48157 27.27928 26.93254 26.60646 26.34913
dim(svd_blk_h5$u)
[1] 120 5
dim(svd_blk_h5$v)
[1] 300 5
The prcomp() method provides a PCA interface for HDF5-backed matrices.
The result includes standard PCA summaries together with HDF5-backed
matrices for quantities such as scores and loadings. The argument
ncomponents controls the number of principal components computed; a
value of 0 requests all available components.
set.seed(124)
n <- 180
p <- 40
group <- rep(c("Group 1", "Group 2", "Group 3"), each = n / 3)
X_pca <- matrix(rnorm(n * p), nrow = n, ncol = p)
X_pca[group == "Group 2", 1:8] <- X_pca[group == "Group 2", 1:8] + 1.5
X_pca[group == "Group 3", 9:16] <- X_pca[group == "Group 3", 9:16] - 1.5
X_pca_h5 <- hdf5_create_matrix(
h5file,
"decomp/X_pca",
data = X_pca,
overwrite = TRUE
)
pca_h5 <- prcomp(
X_pca_h5,
center = TRUE,
scale. = TRUE,
ncomponents = 5,
overwrite = TRUE
)
pca_h5
HDF5PCA - Principal Component Analysis (on-disk)
File : /var/folders/qf/1vjbzhlx1lv7nsddk8ky3p_r0000gn/T//RtmpUjKk6e/filef27d4ed7d48.h5
PCs : 5
center : TRUE | scale: TRUE
sdev[1:3]: 2.1775, 1.4867, 1.3418
cumvar% : 11.85, 17.38, 21.88 ...
rotation : 40 x 40 [HDF5Matrix]
x (ind.) : 180 x 40 [HDF5Matrix]
head(pca_h5$sdev)
[1] 2.177534 1.486692 1.341799 1.305983 1.264360
The PCA scores are stored in the x component of the returned object.
When x is an HDF5-backed matrix, the required columns can be subsetted
and converted to memory for visualization. Here the synthetic groups are
used only to make the low-dimensional structure visible.
class(pca_h5$x)
[1] "HDF5Matrix" "HDF5Matrix" "R6"
dim(pca_h5$x)
[1] 180 40
pca_scores <- as.matrix(pca_h5$x[, 1:2])
pca_df <- data.frame(
PC1 = pca_scores[, 1],
PC2 = pca_scores[, 2],
group = group
)
if (requireNamespace("ggplot2", quietly = TRUE)) {
ggplot2::ggplot(pca_df, ggplot2::aes(PC1, PC2, colour = group)) +
ggplot2::geom_point(size = 2.4, alpha = 0.85) +
ggplot2::stat_ellipse(linewidth = 0.6, alpha = 0.7) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
legend.position = "top",
panel.grid.minor = ggplot2::element_blank()
) +
ggplot2::labs(
title = "PCA of an HDF5-backed matrix",
subtitle = "Scores returned by prcomp.HDF5Matrix()",
x = "PC1",
y = "PC2",
colour = "Synthetic group"
)
} else {
plot(
pca_df$PC1,
pca_df$PC2,
pch = 19,
xlab = "PC1",
ylab = "PC2",
main = "PCA of an HDF5-backed matrix"
)
}
Figure 1: PCA scores computed from an HDF5-backed matrix
The QR decomposition returns HDF5-backed Q and R matrices. The
argument thin controls whether a reduced decomposition is returned.
With thin = TRUE, Q has only the columns needed to reconstruct the
input matrix, which is usually preferable for tall matrices because it
avoids storing unnecessary columns. With thin = FALSE, a full Q
matrix is returned.
Rather than comparing the signs of individual columns, the example
validates the factorization and the orthogonality of Q.
QR_h5 <- qr(A_h5, thin = TRUE, overwrite = TRUE)
Q <- as.matrix(QR_h5$Q)
R <- as.matrix(QR_h5$R)
all.equal(Q %*% R, A, tolerance = 1e-6)
[1] TRUE
all.equal(crossprod(Q), diag(ncol(Q)), tolerance = 1e-6)
[1] TRUE
For tall-skinny matrices, the QR method can use the TSQR path. The
method = "auto" setting selects the backend according to the matrix
shape, while method = "tsqr" requests the tall-skinny algorithm
explicitly.
X_tsqr <- matrix(rnorm(600 * 30), nrow = 600, ncol = 30)
X_tsqr_h5 <- hdf5_create_matrix(
h5file,
"decomp/X_tsqr",
data = X_tsqr,
overwrite = TRUE
)
QR_tsqr_h5 <- qr(
X_tsqr_h5,
thin = TRUE,
method = "tsqr",
threads = 2L,
overwrite = TRUE
)
dim(QR_tsqr_h5$Q)
[1] 600 30
dim(QR_tsqr_h5$R)
[1] 30 30
For symmetric positive-definite matrices, chol() and solve() can be
used directly on HDF5Matrix objects.
set.seed(321)
Z <- matrix(rnorm(200 * 50), nrow = 200, ncol = 50)
SPD <- crossprod(Z) + diag(1e-6, 50)
SPD_h5 <- hdf5_create_matrix(
h5file,
"decomp/SPD",
data = SPD,
overwrite = TRUE
)
Ch_h5 <- chol(SPD_h5, overwrite = TRUE)
Inv_h5 <- solve(SPD_h5, overwrite = TRUE)
Ch <- as.matrix(Ch_h5)
chol_error <- min(
max(abs(crossprod(Ch) - SPD)),
max(abs(tcrossprod(Ch) - SPD))
)
chol_error < 1e-6
[1] TRUE
all.equal(as.matrix(Inv_h5), solve(SPD), tolerance = 1e-6)
[1] TRUE
Additional decompositions are available through S3 methods. For example,
eigen() can be applied to symmetric HDF5-backed matrices, and
pseudoinverse() computes the Moore–Penrose pseudoinverse.
Eig_h5 <- eigen(
SPD_h5,
symmetric = TRUE,
k = 5L,
overwrite = TRUE
)
head(Eig_h5$values)
[1] 430.0639 406.9944 392.0909 374.7714 365.4983
dim(Eig_h5$vectors)
[1] 50 5
Pinv_h5 <- pseudoinverse(
A_h5,
overwrite = TRUE
)
dim(Pinv_h5)
[1] 80 300
HDF5 datasets can be stored with different compression levels. Lower
compression levels generally reduce writing time but use more disk
space. Higher compression levels may reduce file size but can increase
the time needed to write and read data. The default value in
BigDataStatMeth is compression = 6, which provides a balanced
setting for many workflows.
The following small example illustrates the trade-off between writing time and file size. It is not intended as a formal benchmark.
set.seed(123)
X_cmp <- round(matrix(rnorm(2500 * 250), nrow = 2500, ncol = 250), 2)
f0 <- tempfile(fileext = ".h5")
f6 <- tempfile(fileext = ".h5")
t0 <- system.time(
hdf5_create_matrix(
f0, "data/X",
data = X_cmp,
compression = 0,
overwrite = TRUE
)
)
t6 <- system.time(
hdf5_create_matrix(
f6, "data/X",
data = X_cmp,
compression = 6,
overwrite = TRUE
)
)
data.frame(
compression = c(0, 6),
elapsed = c(t0[["elapsed"]], t6[["elapsed"]]),
file_size_MB = round(file.info(c(f0, f6))$size / 1024^2, 3)
)
compression elapsed file_size_MB
1 0 0.002 4.771
2 6 0.175 1.151
Formal performance comparisons should be run on the target hardware and with representative datasets.
When datasets are deleted or overwritten in an HDF5 file, the physical
file size on disk does not always decrease immediately. BigDataStatMeth
creates HDF5 files using a file space management strategy that allows
free space inside the file to be tracked and reused by subsequent writes.
Thus, space released by removed datasets can be reused within the same
file instead of unnecessarily increasing the file size after repeated
operations.
This behavior helps control file growth during workflows that create,
overwrite, or remove intermediate datasets. Nevertheless, reducing the
physical size of an existing HDF5 file may require repacking the file.
The HDF Group provides external command-line tools for this purpose. In
particular, the h5repack user guide
describes how to rewrite an HDF5 file into a new compacted or
reconfigured file. This is an external HDF5 utility and is not executed
from this vignette.
HDF5-backed objects keep file handles open while they are in use. In ordinary interactive workflows these handles are released when objects are garbage-collected. Users can also close objects explicitly.
close(X_h5_reopened)
The function hdf5_close_all() closes the HDF5 handles currently tracked
by the package, including open file handles. After this call,
HDF5-backed objects that pointed to those handles should be reopened
before further use.
Calling gc() may help trigger R finalizers for objects that are no
longer referenced, which can be useful when repeatedly re-running code
during development.
hdf5matrix_options(
paral = old_opts$paral,
block_size = old_opts$block_size,
threads = old_opts$threads,
compression = old_opts$compression
)
hdf5_close_all()
gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 1600235 85.5 2780921 148.6 NA 2780921 148.6
Vcells 3688708 28.2 8388608 64.0 36864 5181876 39.6
The examples above use the standard R interface, which is the recommended
entry point for most users. However, BigDataStatMeth also provides a
C++ infrastructure for developers who need to implement new scalable
methods.
Internally, the package defines C++ classes for managing HDF5 files, groups, and datasets, together with block-wise routines for linear algebra and statistical operations. These are the same computational building blocks used by the R/S3 interface. As a result, developers can reuse the package infrastructure from Rcpp-based code instead of implementing HDF5 file management, block iteration, and numerical operations from scratch.
This makes the package useful not only as an end-user R package, but also as a development framework for extending HDF5-backed statistical computing in C++. Since the purpose of this vignette is to introduce the standard R interface, the C++ API is not discussed in detail here. Nevertheless, it is an important part of the package architecture and provides the computational infrastructure behind the R/S3 methods shown above.
sessionInfo()
R version 4.5.3 (2026-03-11)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BigDataStatMeth_2.0.0 knitr_1.50 BiocStyle_2.36.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 dplyr_1.1.4
[4] compiler_4.5.3 BiocManager_1.30.26 tidyselect_1.2.1
[7] Rcpp_1.1.1 bitops_1.0-9 jquerylib_0.1.4
[10] scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
[13] ggplot2_4.0.1 R6_2.6.1 labeling_0.4.3
[16] generics_0.1.4 MASS_7.3-65 tibble_3.3.0
[19] bookdown_0.43 bslib_0.9.0 pillar_1.11.1
[22] RColorBrewer_1.1-3 rlang_1.1.6 cachem_1.1.0
[25] xfun_0.52 sass_0.4.10 S7_0.2.1
[28] cli_3.6.5 withr_3.0.2 magrittr_2.0.4
[31] digest_0.6.37 grid_4.5.3 rstudioapi_0.17.1
[34] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.4
[37] glue_1.8.0 data.table_1.18.2.1 farver_2.1.2
[40] RCurl_1.98-1.18 rmarkdown_2.29 tools_4.5.3
[43] pkgconfig_2.0.3 htmltools_0.5.8.1