| Title: | Dataset Dependency Graphs for Leakage-Aware Evaluation |
| Version: | 0.3.0 |
| Description: | Represent biomedical dataset structure as typed dependency graphs so that sample provenance, repeated-measure structure, study design, batch effects, and temporal relationships are explicit and inspectable. Validates dataset structure, detects sample-level overlap, derives deterministic split constraints, and produces a tool-agnostic split specification for leakage-aware evaluation workflows. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/selcukorkmaz/splitGraph |
| BugReports: | https://github.com/selcukorkmaz/splitGraph/issues |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.1.0) |
| Imports: | graphics, igraph, stats, utils |
| Suggests: | bioLeak, jsonlite, knitr, pkgload, rmarkdown, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| RoxygenNote: | 7.3.3 |
| Packaged: | 2026-07-03 11:48:54 UTC; selcuk |
| Author: | Selcuk Korkmaz |
| Maintainer: | Selcuk Korkmaz <selcukorkmaz@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-07-03 12:40:02 UTC |
splitGraph: Dataset Dependency Graphs for Leakage-Aware Evaluation
Description
The splitGraph package provides typed graph objects for representing
dataset structure, sample provenance, and leakage-relevant dependencies in
biomedical evaluation workflows. It makes dataset dependency structure
explicit enough to validate, query, and convert into a stable, tool-agnostic
split specification (split_spec) for leakage-aware evaluation.
Scope (what splitGraph does)
Model dataset dependency structure as a typed graph.
Validate that structure (structural, semantic, leakage-relevant).
Derive deterministic split constraints from the structure.
Emit and validate the tool-agnostic
split_specinterchange format (with a formal JSON Schema and a Python reference consumer).
Non-goals (what downstream consumers own)
splitGraph deliberately stops at the constraint / split_spec boundary.
It does not generate resamples or folds, perform stratified splitting,
apply purge/embargo, fit or tune models, or produce statistical leakage
evidence. Those belong to downstream consumers. The reference consumer is
bioLeak, whose as_leaksplits() turns a split_spec into an
executable split plan; split_spec is neutral, so other tools (an
rsample adapter, the shipped Python reader, etc.) can consume it
equally. See the split_spec contract in ?as_split_spec and the
"Scope & relationship to bioLeak" section of the README.
Author(s)
Maintainer: Selcuk Korkmaz selcukorkmaz@gmail.com (ORCID)
See Also
Useful links:
Report bugs at https://github.com/selcukorkmaz/splitGraph/issues
Translate splitGraph Constraints into Stable Split Specifications
Description
Translate graph-derived split constraints into a stable, inspectable structure for sample-level grouping, blocking, and ordering, perform preflight structural checks on that translation, and summarize structural leakage risks.
Usage
as_split_spec(constraint, graph = NULL)
validate_split_spec(x)
summarize_leakage_risks(
graph,
constraint = NULL,
split_spec = NULL,
validation = NULL
)
Arguments
constraint |
A |
graph |
A |
x |
A |
split_spec |
An optional |
validation |
An optional |
Details
The translation layer always produces canonical sample-level columns
including sample_id, sample_node_id, group_id, and
primary_group. When available, it also carries batch_group,
study_group, timepoint_id, time_index, and
order_rank. Missing but relevant fields are retained as NA
columns rather than omitted.
When only a subset of samples has ordering metadata, the translated split
spec still exposes that partial ordering through time_var, but
ordering_required remains FALSE. Ordering is only marked as
required when the constraint implies complete ordering coverage.
The split-spec validator checks:
missing required columns
missing or duplicated sample identifiers
missing grouping assignments
singleton-only grouping structures
missing ordering when ordering is required
invalid or empty block variables
Repeated validation of the same split spec yields deterministic issue IDs and diagnostics, which makes the returned validation object stable across runs.
The produced split_spec is tool-agnostic. Downstream consumers are
expected to provide their own adapters to convert a split_spec into
their native split representation, so splitGraph has no runtime
dependency on any of them.
summarize_leakage_risks() reuses validate_graph() and
split_constraint metadata rather than duplicating downstream
evaluation logic.
Value
as_split_spec() returns a split_spec.
validate_split_spec() returns a split_spec_validation.
summarize_leakage_risks() returns a leakage_risk_summary.
Examples
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4"),
subject_id = c("P1", "P1", "P2", "P2")
)
g <- graph_from_metadata(meta)
constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(constraint, graph = g)
validate_split_spec(spec)
summarize_leakage_risks(g, constraint = constraint, split_spec = spec)
Assemble and Validate Dependency Graphs
Description
Combine canonical node and edge tables into a typed dependency graph and perform structural, semantic, and graph-local leakage-aware validation.
Usage
build_dependency_graph(
nodes,
edges,
graph_name = NULL,
dataset_name = NULL,
validate = TRUE,
validation_overrides = list()
)
build_depgraph(
nodes,
edges,
graph_name = NULL,
dataset_name = NULL,
validate = TRUE,
validation_overrides = list()
)
as_igraph(x)
validate_graph(
graph,
checks = c("ids", "references", "cardinality", "schema", "time"),
error_on_fail = FALSE,
levels = NULL,
severities = NULL,
validation_overrides = NULL
)
validate_depgraph(
graph,
checks = c("ids", "references", "cardinality", "schema", "time"),
error_on_fail = FALSE,
levels = NULL,
severities = NULL,
validation_overrides = NULL
)
Arguments
nodes, edges |
Lists of |
graph_name, dataset_name |
Optional metadata labels. |
validate |
If |
validation_overrides |
Optional named list of explicit validation exceptions. Currently supported keys:
When passed to |
x |
A |
graph |
A |
checks |
Deprecated. Use |
error_on_fail |
If |
levels |
Optional validation layers to run. |
severities |
Optional severities to retain in the returned
|
Value
For build_dependency_graph(), a dependency_graph. For
validate_graph() and validate_depgraph(), a
depgraph_validation_report. For as_igraph(), the underlying
igraph object.
Examples
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
samples <- create_nodes(meta, type = "Sample", id_col = "sample_id")
subjects <- create_nodes(meta, type = "Subject", id_col = "subject_id")
edges <- create_edges(
meta,
"sample_id",
"subject_id",
"Sample",
"Subject",
"sample_belongs_to_subject"
)
g <- build_dependency_graph(list(samples, subjects), list(edges))
validate_graph(g)
Create Canonical Node and Edge Tables
Description
Build canonical node and edge tables from ordinary metadata frames.
Usage
create_nodes(
data,
type,
id_col,
label_col = NULL,
attr_cols = NULL,
prefix = TRUE,
dedupe = TRUE
)
create_edges(
data,
from_col,
to_col,
from_type,
to_type,
relation,
attr_cols = NULL,
allow_missing = FALSE,
dedupe = TRUE,
from_prefix = TRUE,
to_prefix = TRUE
)
Arguments
data |
A |
type, from_type, to_type |
Supported node types such as |
id_col |
Column containing the source identifier for the node type. |
label_col |
Optional column used for node labels. |
attr_cols |
Optional columns stored in the |
prefix |
If |
dedupe |
If |
from_col, to_col |
Source and target identifier columns for edge creation. |
relation |
Canonical edge type. |
allow_missing |
If |
from_prefix, to_prefix |
Whether to prepend typed prefixes when constructing the edge endpoint identifiers. Defaults preserve the canonical prefixed-ID format. |
Details
The package uses typed node identifiers such as sample:S1 as the
canonical graph representation. If you create node sets with
prefix = FALSE, the corresponding edge endpoints must use matching
prefix settings via from_prefix and to_prefix.
When dedupe = TRUE, exact duplicate node or edge definitions are
collapsed, but conflicting definitions for the same canonical node
identifier or edge relation are rejected with an error.
Value
For create_nodes(), a graph_node_set. For
create_edges(), a graph_edge_set.
Examples
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
samples <- create_nodes(meta, type = "Sample", id_col = "sample_id")
edges <- create_edges(
meta,
from_col = "sample_id",
to_col = "subject_id",
from_type = "Sample",
to_type = "Subject",
relation = "sample_belongs_to_subject"
)
Validation Report Object for splitGraph Graphs
Description
depgraph_validation_report is the structured return type produced by
validate_graph() and validate_depgraph().
Usage
depgraph_validation_report(
graph_name = NULL,
issues = NULL,
metrics = list(),
metadata = list(),
valid = NULL,
errors = NULL,
warnings = NULL,
advisories = NULL
)
split_spec(
sample_data = NULL,
group_var = "group_id",
block_vars = character(),
time_var = NULL,
ordering_required = FALSE,
constraint_mode = NULL,
constraint_strategy = NULL,
recommended_resampling = NULL,
metadata = list()
)
split_spec_validation(issues = NULL, metadata = list())
leakage_risk_summary(
overview = character(),
diagnostics = NULL,
validation_summary = list(),
constraint_summary = list(),
split_spec_summary = list(),
metadata = list()
)
Arguments
graph_name |
Graph label stored on the report. |
issues |
Canonical issue table. When |
metrics |
Named list of graph- and issue-level counts. |
metadata |
Named list of report metadata. |
valid |
Optional logical override for the overall validity flag. |
errors, warnings, advisories |
Optional character vectors of severity-specific messages. |
sample_data |
Sample-level mapping table carried by a
|
group_var |
Name of the grouping column. |
block_vars |
Optional blocking variable names. |
time_var |
Optional ordering column name. |
ordering_required |
Whether ordering is required for downstream evaluation. |
constraint_mode, constraint_strategy |
Constraint-derivation metadata. |
recommended_resampling |
Optional recommended resampling routine. |
overview |
Character vector of human-readable overview lines. |
diagnostics |
Diagnostics data frame for leakage risks. |
validation_summary, constraint_summary, split_spec_summary |
Named lists carrying pre-computed summaries. |
Details
The report contains:
-
graph_name: graph label when available -
valid: whether anyerror-severity issues were found -
issues: canonical issue table -
summary: counts by level, severity, and code -
metadata: report metadata -
errors,warnings,advisories: backward-compatible message vectors -
metrics: graph and issue counts
The canonical issue table includes the columns:
issue_id, level, severity, code, message,
node_ids, edge_ids, and details.
Value
An S3 object corresponding to the constructor that was called.
See Also
Examples
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
g <- graph_from_metadata(meta)
report <- validate_graph(g)
report$valid
summary(report)
Derive Split Constraints from Dependency Graphs
Description
Convert dataset dependency structure into deterministic sample-level grouping constraints suitable for leakage-aware evaluation design.
Usage
derive_split_constraints(
graph,
mode = c("subject", "batch", "study", "time", "site", "region", "platform", "assay",
"relatedness", "spatial", "composite"),
samples = NULL,
strategy = c("strict", "rule_based"),
via = NULL,
priority = NULL,
include_warnings = TRUE
)
grouping_vector(x)
Arguments
graph |
A |
mode |
Constraint derivation mode. |
samples |
Optional sample identifiers or sample node IDs used to
restrict the returned |
strategy |
Composite grouping strategy. Ignored for non-composite modes. |
via |
Optional dependency sources used for composite grouping. May be
given as lower-case modes such as |
priority |
Optional priority order used for
|
include_warnings |
Whether to retain human-readable warnings in the returned metadata. |
x |
A |
Details
Constraint derivation rules:
mode = "subject"Groups samples by the target of
sample_belongs_to_subject. All samples linked to the sameSubjectreceive the samegroup_id.mode = "batch"Groups samples by the target of
sample_processed_in_batch. Samples with no batch assignment are retained as singleton unlinked groups and recorded in metadata warnings.mode = "study"Groups samples by the target of
sample_from_study.mode = "site"Groups samples by the target of
sample_collected_at_site. Samples with no site assignment are retained as singleton unlinked groups and recorded in metadata warnings.mode = "region"Groups samples by the target of
sample_located_in_region(e.g. a categorical tissue or anatomical region). Samples with no region assignment are retained as singleton unlinked groups and recorded in metadata warnings.mode = "platform"Groups samples by the target of
sample_run_on_platform(the sequencing / measurement platform or instrument). Samples with no platform assignment are retained as singleton unlinked groups and recorded in metadata warnings.mode = "assay"Groups samples by the target of
sample_measured_by_assay(the assay / modality). Samples with no assay assignment are retained as singleton unlinked groups and recorded in metadata warnings.mode = "relatedness"Groups samples by transitive closure over thresholded
subject_related_toedges (genetic relatedness). Samples that share a subject, or whose subjects are directly or indirectly related above threshold, land in the same connected-component group. Build the edges withrelatedness_edges_from_kinship. Samples with no subject are retained as singleton groups (recorded in metadata warnings).mode = "spatial"Groups samples by transitive closure over thresholded
sample_adjacent_toedges (spatial proximity). Build the edges withspatial_edges_from_coords. Isolated samples form singleton groups.mode = "time"Groups samples by the target of
sample_collected_at_timepoint. WhenTimepointnodes havetime_indexmetadata, that value is used to deriveorder_rank. Iftime_indexis unavailable, the function attempts to derive ordering fromtimepoint_precedesedges over the timepoint subgraph.mode = "composite",strategy = "strict"Projects the selected dependency relations onto a sample graph and assigns one
group_idper connected component. This is the transitive-closure interpretation of composite dependency grouping.mode = "composite",strategy = "rule_based"Evaluates dependency assignments in deterministic priority order and groups each sample by the highest-priority available dependency source. Lower-priority available dependencies are retained in the explanation field.
The returned split_constraint$sample_map always contains
sample_id, sample_node_id, group_id,
constraint_type, group_label, and explanation.
Time-aware constraints also include time_index, timepoint_id,
and order_rank when available.
Ambiguous direct assignments are rejected. A sample cannot be assigned to multiple batches, studies, or timepoints when deriving direct split constraints.
Value
derive_split_constraints() returns a split_constraint
whose sample_map contains grouping assignments and, for time-aware
constraints, ordering metadata. grouping_vector() returns a named
character vector of group_id values keyed by sample_id.
Examples
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4"),
subject_id = c("P1", "P1", "P2", "P2"),
batch_id = c("B1", "B2", "B1", "B2")
)
g <- graph_from_metadata(meta)
constraint <- derive_split_constraints(g, mode = "subject")
grouping_vector(constraint)
Build a Dependency Graph Directly from a Metadata Table
Description
One-shot convenience builder that auto-detects canonical columns in a
metadata table, creates the corresponding node and edge sets, optionally
derives timepoint ordering from time_index, and assembles a
dependency_graph. Columns that are absent or entirely missing are
silently skipped.
Usage
graph_from_metadata(
meta,
columns = NULL,
dataset_name = NULL,
graph_name = NULL,
outcome_scope = c("sample", "subject"),
time_precedence = TRUE,
validate = TRUE,
validation_overrides = list()
)
Arguments
meta |
A |
columns |
Optional named character vector passed to
|
dataset_name, graph_name |
Optional metadata labels. |
outcome_scope |
Either |
time_precedence |
If |
validate |
Forwarded to |
validation_overrides |
Forwarded to |
Value
A validated dependency_graph.
Examples
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4"),
subject_id = c("P1", "P1", "P2", "P2"),
batch_id = c("B1", "B2", "B1", "B2"),
timepoint_id = c("T1", "T2", "T1", "T2"),
time_index = c(1, 2, 1, 2),
outcome_id = c("ctrl", "case", "ctrl", "case")
)
g <- graph_from_metadata(meta, graph_name = "demo")
g
Construct Core splitGraph S3 Objects
Description
Low-level constructors for the core S3 classes used throughout splitGraph.
Usage
graph_node_set(
data = NULL,
schema_version = .depgraph_schema_version,
source = list()
)
graph_edge_set(
data = NULL,
schema_version = .depgraph_schema_version,
source = list()
)
dependency_graph(nodes, edges, graph, metadata = list(), caches = list())
new_depgraph_nodes(
data = NULL,
schema_version = .depgraph_schema_version,
source = list()
)
new_depgraph_edges(
data = NULL,
schema_version = .depgraph_schema_version,
source = list()
)
new_depgraph(nodes, edges, graph = NULL, metadata = list(), caches = list())
graph_query_result(
query = "",
params = list(),
nodes = NULL,
edges = NULL,
table = NULL,
metadata = list()
)
dependency_constraint(
constraint_id,
relation_types,
sample_map,
transitive = TRUE,
metadata = list()
)
split_constraint(
strategy,
sample_map,
recommended_downstream_args = list(),
metadata = list()
)
Arguments
data |
A data frame matching the canonical schema for nodes or edges. |
schema_version |
Schema version string stored on the object. |
source |
Optional source metadata. |
nodes, edges |
A |
graph |
An internal |
metadata, caches, params, recommended_downstream_args |
Named lists with auxiliary metadata. |
query |
Query label stored on a |
table |
Tabular query result payload. |
constraint_id, relation_types, transitive |
Fields describing a dependency constraint. |
sample_map |
Sample-level mapping table for constraints. |
strategy |
Split strategy identifier. |
Value
An S3 object corresponding to the constructor that was called.
Examples
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
samples <- create_nodes(meta, type = "Sample", id_col = "sample_id")
subjects <- create_nodes(meta, type = "Subject", id_col = "subject_id")
edges <- create_edges(
meta,
from_col = "sample_id",
to_col = "subject_id",
from_type = "Sample",
to_type = "Subject",
relation = "sample_belongs_to_subject"
)
nodes_set <- graph_node_set(rbind(samples$data, subjects$data))
edges_set <- graph_edge_set(edges$data)
nodes_set
edges_set
Standardize Sample Metadata
Description
Normalize user-provided metadata into the canonical column contract used by splitGraph.
Usage
ingest_metadata(data, col_map = NULL, dataset_name = NULL, strict = TRUE)
Arguments
data |
A sample-level |
col_map |
Optional named character vector mapping canonical names to user-provided columns. |
dataset_name |
Optional dataset label stored as an attribute on the returned table. |
strict |
If |
Value
A standardized data.frame with canonical identifier columns
coerced to character.
Examples
meta <- ingest_metadata(
data.frame(sample_id = c("S1", "S2"), subject_id = c("P1", "P2"))
)
Upgrade Serialized splitGraph JSON to the Current Schema Version
Description
Read a dependency_graph or split_spec JSON file written under
an older schema_version and rewrite it at the installed version. The
round-trip fills any field introduced since the file was written with its
default (NA for missing sample_data columns), stamps the
current schema_version, and adds the $schema reference. Files
already at the current version are rewritten unchanged.
Usage
migrate_dependency_graph_json(path, out = path)
migrate_split_spec_json(path, out = path)
Arguments
path |
Path to the JSON file to upgrade. |
out |
Path to write the upgraded file to. Defaults to |
Value
The output path, invisibly.
Examples
if (requireNamespace("jsonlite", quietly = TRUE)) {
meta <- data.frame(sample_id = c("S1", "S2"), subject_id = c("P1", "P2"))
g <- graph_from_metadata(meta)
tmp <- tempfile(fileext = ".json")
write_dependency_graph(g, tmp)
migrate_dependency_graph_json(tmp)
unlink(tmp)
}
Build Pairwise Leakage Edges from Continuous Similarity
Description
Helpers that turn a continuous, pairwise similarity signal into the
thresholded, undirected edges consumed by
derive_split_constraints(mode = "relatedness") and
derive_split_constraints(mode = "spatial"). Only pairs that pass the
threshold become edges; the derivation modes then form groups as connected
components over those edges (transitive closure), so a chain of individually
below-radius neighbours can still land in one group.
Usage
relatedness_edges_from_kinship(
pairs,
threshold,
id1 = "id1",
id2 = "id2",
kinship = "kinship"
)
spatial_edges_from_coords(coords, radius, id = "sample_id", coord_cols = NULL)
Arguments
pairs |
A data.frame of subject pairs with two id columns and a metric column. |
threshold |
Minimum kinship value (inclusive) for a pair to be kept. |
id1, id2 |
Column names in |
kinship |
Column name in |
coords |
A data.frame with one row per sample: a sample id column plus the numeric coordinate columns. |
radius |
Maximum distance (inclusive) for two samples to be adjacent. |
id |
Column name in |
coord_cols |
Character vector of coordinate columns in |
Details
relatedness_edges_from_kinship() keeps subject pairs whose kinship (or
relatedness) coefficient is at least threshold and emits
subject_related_to edges (Subject -> Subject).
spatial_edges_from_coords() keeps sample pairs whose Euclidean
distance over the coordinate columns is at most radius and
emits sample_adjacent_to edges (Sample -> Sample).
Both return a graph_edge_set that can be combined with the other node
and edge sets in build_dependency_graph(). The passing metric value is
carried on each edge as an attribute (kinship / distance).
Value
A graph_edge_set.
Examples
pairs <- data.frame(
id1 = c("P1", "P1", "P2"),
id2 = c("P2", "P3", "P3"),
kinship = c(0.25, 0.02, 0.30)
)
relatedness_edges_from_kinship(pairs, threshold = 0.1)
coords <- data.frame(
sample_id = c("S1", "S2", "S3"),
x = c(0, 1, 9),
y = c(0, 1, 9)
)
spatial_edges_from_coords(coords, radius = 2)
Query Dependency Graph Structure
Description
Query graph neighborhoods, typed nodes and edges, path structure, projected
sample dependency components, and direct shared dependencies within a
dependency_graph.
Usage
query_node_type(graph, node_types, ids = NULL)
query_edge_type(graph, edge_types, node_ids = NULL)
query_neighbors(
graph,
node_ids,
edge_types = NULL,
node_types = NULL,
direction = c("out", "in", "all")
)
query_paths(
graph,
from,
to,
edge_types = NULL,
node_types = NULL,
mode = c("out", "in", "all"),
max_length = NULL
)
query_shortest_paths(
graph,
from,
to,
edge_types = NULL,
node_types = NULL,
mode = c("out", "in", "all")
)
detect_dependency_components(
graph,
via = c("Subject", "Batch", "Study", "Timepoint", "Assay", "FeatureSet", "Outcome"),
edge_types = NULL,
min_size = 1
)
detect_shared_dependencies(
graph,
via = c("Subject", "Batch", "Study", "Timepoint"),
samples = NULL
)
Arguments
graph |
A |
node_types |
Optional node types used to filter node results or allowed path members. |
ids |
Optional node identifiers used to further restrict
|
edge_types |
Optional edge types used to filter the traversal graph or edge table. |
node_ids, from, to |
Node identifiers to use as query seeds or endpoints. |
direction, mode |
Traversal direction. |
max_length |
Maximum path length (number of edges) for
|
via |
Dependency node types used for sample-level dependency detection. |
min_size |
Minimum component size retained by
|
samples |
Optional sample identifiers or sample node IDs used to restrict direct shared-dependency detection. All requested samples must resolve successfully. |
Details
When a samples subset is supplied, partial matching is not allowed:
unknown sample identifiers raise an error rather than being silently
dropped.
Value
Each function returns a graph_query_result. Use
as.data.frame() to obtain the tidy result table.
Examples
meta <- data.frame(
sample_id = c("S1", "S2", "S3"),
subject_id = c("P1", "P1", "P2"),
batch_id = c("B1", "B2", "B1")
)
g <- graph_from_metadata(meta)
query_node_type(g, "Sample")
query_neighbors(g, "sample:S1", direction = "out")
detect_shared_dependencies(g, via = "Subject")
Validate Serialized splitGraph JSON Against the Shipped Schema
Description
Check that a JSON file written by write_dependency_graph() or
write_split_spec() conforms to the splitGraph on-disk contract. The
formal JSON Schemas (Draft 2020-12) ship in inst/schema/ and are
referenced from the written JSON via the $schema key; these functions
apply a dependency-free structural check of the same invariants (required
fields, value types, node/edge-type enumerations, and referential integrity
of edge endpoints) so a handoff file can be validated without a JSON Schema
engine.
Usage
validate_graph_json(path)
validate_split_spec_json(path)
Arguments
path |
Path to a serialized |
Value
A splitgraph_json_report: a list with valid (logical),
issues (character vector of failures), the detected
object_type, and the schema $id.
Examples
if (requireNamespace("jsonlite", quietly = TRUE)) {
meta <- data.frame(sample_id = c("S1", "S2"), subject_id = c("P1", "P2"))
g <- graph_from_metadata(meta)
tmp <- tempfile(fileext = ".json")
write_dependency_graph(g, tmp)
validate_graph_json(tmp)
unlink(tmp)
}
Serialize a Dependency Graph to JSON
Description
Write a dependency_graph to a JSON file and read it back. The on-disk
format is intentionally simple and stable: it captures the canonical node
table, the canonical edge table (each with their list-column of
attributes), the graph metadata (including validation_overrides),
and the data-model schema_version. The internal igraph
representation is not stored; it is rebuilt on read via
dependency_graph().
Usage
write_dependency_graph(graph, path, pretty = TRUE)
read_dependency_graph(path)
Arguments
graph |
A |
path |
Path to write to or read from. |
pretty |
If |
Details
This makes split_spec/dependency_graph objects portable
across R sessions, and across language boundaries (any consumer that can
read JSON can interpret the format).
Value
write_dependency_graph() invisibly returns path.
read_dependency_graph() returns a validated
dependency_graph.
JSON format
{
"$schema": "https://.../inst/schema/dependency_graph.schema.json",
"splitGraph_object": "dependency_graph",
"schema_version": "0.2.0",
"metadata": {
"graph_name": "...",
"dataset_name": "...",
"created_at": "2026-04-29T10:11:12.000000+0000",
"schema_version": "0.2.0",
"validation_overrides": { ... }
},
"nodes": [
{ "node_id": "sample:S1", "node_type": "Sample",
"node_key": "S1", "label": "S1", "attrs": { ... } },
...
],
"edges": [
{ "edge_id": "sample_belongs_to_subject:1",
"from": "sample:S1", "to": "subject:P1",
"edge_type": "sample_belongs_to_subject", "attrs": { ... } },
...
]
}
Reading a file whose schema_version shares the installed major
version loads silently (additive-only differences); a differing major
version loads with a warning suggesting migrate_dependency_graph_json().
The written JSON also carries a $schema reference to the formal JSON
Schema shipped in inst/schema/; validate a file against it with
validate_graph_json().
Examples
if (requireNamespace("jsonlite", quietly = TRUE)) {
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
g <- graph_from_metadata(meta, graph_name = "demo")
tmp <- tempfile(fileext = ".json")
write_dependency_graph(g, tmp)
g2 <- read_dependency_graph(tmp)
identical(g$nodes$data$node_id, g2$nodes$data$node_id)
unlink(tmp)
}
Serialize a Split Specification to JSON
Description
Write a split_spec to a JSON file and read it back. The on-disk
format captures the canonical sample-level table (sample_data) plus
all spec-level fields needed by a downstream resampling adapter
(group_var, block_vars, time_var,
ordering_required, constraint_mode,
constraint_strategy, recommended_resampling) and the spec
metadata.
Usage
write_split_spec(spec, path, pretty = TRUE)
read_split_spec(path)
Arguments
spec |
A |
path |
Path to write to or read from. |
pretty |
If |
Details
NA values in sample_data are written as JSON null and
read back as NA.
Value
write_split_spec() invisibly returns path.
read_split_spec() returns a split_spec.
JSON format
{
"$schema": "https://.../inst/schema/split_spec.schema.json",
"splitGraph_object": "split_spec",
"schema_version": "0.2.0",
"group_var": "group_id",
"block_vars": ["batch_group", "study_group"],
"time_var": "order_rank",
"ordering_required": false,
"constraint_mode": "subject",
"constraint_strategy": "subject",
"recommended_resampling": "grouped_cv",
"metadata": { ... },
"sample_data": [
{ "sample_id": "S1", "group_id": "subject:P1", ... },
...
]
}
Examples
if (requireNamespace("jsonlite", quietly = TRUE)) {
meta <- data.frame(
sample_id = c("S1", "S2"),
subject_id = c("P1", "P2")
)
g <- graph_from_metadata(meta)
constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(constraint, graph = g)
tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)
unlink(tmp)
}