datadiff compares two datasets — a
reference and a candidate — using
validation rules defined in a YAML file. It is built on top of pointblank and supports
exact matching, tolerance-based numeric comparisons, text normalization,
and row count validation.
The typical workflow is:
compare_datasets_from_yaml() and inspect the
result.library(datadiff)
ref <- data.frame(
id = 1:4,
revenue = c(1000.00, 2000.00, 3000.00, 4000.00),
category = c("A", "B", "C", "D"),
active = c(TRUE, TRUE, FALSE, TRUE)
)
cand <- data.frame(
id = 1:4,
revenue = c(1000.005, 2000.001, 3000.009, 4000.00), # tiny differences
category = c("a", "b", "c", "D"), # lowercase
active = c(TRUE, TRUE, FALSE, TRUE)
)Generate a rules template and tune it:
rules_path <- tempfile(fileext = ".yaml")
write_rules_template(
ref,
key = "id",
path = rules_path,
numeric_abs = 0.01, # accept differences up to 0.01
character_case_insensitive = TRUE # ignore case for all char columns
)Run the comparison:
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)
result$all_passed
#> [1] TRUEcompare_datasets_from_yaml() returns a list with six
elements:
names(result)
#> [1] "all_passed" "agent" "reponse"
#> [4] "missing_in_candidate" "extra_in_candidate" "applied_rules"| Element | Description |
|---|---|
all_passed |
TRUE if every validation step passed |
agent |
The configured pointblank agent (before interrogation) |
reponse |
The interrogated pointblank agent (full results) |
missing_in_candidate |
Columns present in reference but absent from candidate |
extra_in_candidate |
Columns present in candidate but absent from reference |
applied_rules |
The effective per-column rules that were applied |
applied_rules shows the exact rules used for each column
— useful to verify that by_name overrides were applied
correctly:
When all_passed is FALSE, use
pointblank::get_sundered_data() to extract the rows that
failed at least one validation step:
ref_fail <- data.frame(id = 1:5, value = c(1, 2, 3, 4, 5))
cand_fail <- data.frame(id = 1:5, value = c(1, 2, 99, 4, 99)) # rows 3 and 5 wrong
result_fail <- compare_datasets_from_yaml(ref_fail, cand_fail, key = "id")
result_fail$all_passed
#> [1] FALSE
# Rows that failed at least one step
failed_rows <- pointblank::get_sundered_data(result_fail$reponse, type = "fail")
failed_rows
#> id value value__reference value__absdiff value__thresh value__ok row_count_ok
#> 1 3 99 3 96 1e-09 FALSE TRUE
#> 2 5 99 5 94 1e-09 FALSE TRUEThe type = "pass" variant returns rows that passed all
steps. This is useful to understand the scope of the problem before
investigating further.
For large datasets, extracting all failing rows can consume significant memory. Three mutually exclusive parameters cap this:
# Keep only the first 100 failing rows per validation step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
get_first_n = 100)
# Random sample of 50 failing rows per step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
sample_n = 50)
# 10% of failing rows, capped at 500
result <- compare_datasets_from_yaml(ref, cand, key = "id",
sample_frac = 0.1, sample_limit = 500)
# Disable extraction entirely (fastest — only pass/fail counts are kept)
result <- compare_datasets_from_yaml(ref, cand, key = "id",
extract_failed = FALSE)When path = NULL (the default), datadiff
auto-generates rules from the reference dataset structure. This is
useful for a quick sanity check without any configuration:
ref_quick <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))
cand_quick <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))
# No path needed — rules are generated on the fly
result_quick <- compare_datasets_from_yaml(ref_quick, cand_quick, key = "id")
result_quick$all_passed
#> [1] TRUEThe auto-generated rules use near-exact numeric tolerance
(abs = 1e-9) and exact character matching — equivalent to
calling write_rules_template() with all defaults.
write_rules_template() generates a fully annotated YAML.
Here is a complete example with all sections explained:
version: 1
defaults:
na_equal: yes # treat NA == NA as a pass
ignore_columns: # columns excluded from comparison entirely
- documentation
- updated_at
keys: id # join key (single or composite)
label: ref vs cand # label shown in the pointblank report
row_validation:
check_count: yes
expected_count: ~ # null = use reference row count
tolerance: 0 # exact match required
by_type: # rules applied to all columns of a given type
numeric:
abs: 1.0e-09 # near-exact by default
rel: 0
integer:
abs: 0 # integers must match exactly
character:
equal_mode: exact
case_insensitive: no
trim: no
date:
equal_mode: exact
datetime:
equal_mode: exact
logical:
equal_mode: exact
by_name: # column-specific overrides (take precedence over by_type)
id: [] # no override — inherits integer rule
revenue:
abs: 0.01 # accept differences up to 0.01
category:
case_insensitive: yes
trim: yesRules are merged: by_name entries
extend or override by_type entries. A field not listed in
by_name keeps its by_type default.
| Column | Effective rule | Source |
|---|---|---|
id |
abs: 0 |
by_type.integer |
revenue |
abs: 0.01, rel: 0 |
by_name overrides by_type.numeric |
category |
case_insensitive: yes, trim: yes |
by_name overrides by_type.character |
Use read_rules() to inspect what was actually loaded —
useful for debugging or building tooling on top of
datadiff:
loaded <- read_rules(rules_path)
loaded$defaults$na_equal
#> [1] TRUE
loaded$by_type$numeric
#> $abs
#> [1] 0.01
#>
#> $rel
#> [1] 0
loaded$by_type$character
#> $equal_mode
#> [1] "exact"
#>
#> $case_insensitive
#> [1] TRUE
#>
#> $trim
#> [1] FALSEby_name exampleThe following dataset mixes several column types, each requiring a different validation strategy:
ref_full <- data.frame(
id = 1:4,
price = c(9.99, 19.99, 4.50, 149.00), # numeric: small absolute tolerance
quantity = c(10L, 5L, 20L, 1L), # integer: exact
description = c("Widget A", "Widget B", " Gadget", "TOOL"), # needs trim + case
in_stock = c(TRUE, TRUE, FALSE, TRUE), # logical: exact
created = as.Date(c("2024-01-01", "2024-01-02",
"2024-01-03", "2024-01-04"))
)
cand_full <- data.frame(
id = 1:4,
price = c(9.995, 19.99, 4.50, 149.00), # row 1: diff = 0.005 < 0.01
quantity = c(10L, 5L, 20L, 1L),
description = c("widget a", "Widget B", "Gadget", "tool"), # case + spaces
in_stock = c(TRUE, TRUE, FALSE, TRUE),
created = as.Date(c("2024-01-01", "2024-01-02",
"2024-01-03", "2024-01-04"))
)Build the YAML and write column-specific overrides:
rules_full <- tempfile(fileext = ".yaml")
write_rules_template(
ref_full,
key = "id",
path = rules_full,
numeric_abs = 1e-9, # conservative default
character_case_insensitive = FALSE, # strict default for character
character_trim = FALSE
)
# Read, patch by_name, write back
rules_obj <- read_rules(rules_full)
rules_obj$by_name$price <- list(abs = 0.01) # ±0.01 for price
rules_obj$by_name$description <- list(case_insensitive = TRUE, trim = TRUE)
yaml::write_yaml(rules_obj, rules_full)result_full <- compare_datasets_from_yaml(ref_full, cand_full,
key = "id", path = rules_full)
result_full$all_passed
#> [1] TRUE
# Verify the effective rules for each column
result_full$applied_rules$price
#> $abs
#> [1] 0.01
#>
#> $rel
#> [1] 0
result_full$applied_rules$description
#> $equal_mode
#> [1] "exact"
#>
#> $case_insensitive
#> [1] TRUE
#>
#> $trim
#> [1] TRUE
result_full$applied_rules$quantity
#> $abs
#> [1] 0For every numeric column, the comparison uses a single combined threshold:
threshold = abs + rel × |reference_value|
PASS if |candidate − reference| ≤ threshold
abs)The threshold is constant, independent of the magnitude of the values:
ref_num <- data.frame(id = 1:3, price = c(1.00, 1000.00, 1e6))
cand_ok <- data.frame(id = 1:3, price = c(1.005, 1000.005, 1e6 + 0.005))
cand_nok <- data.frame(id = 1:3, price = c(1.02, 1000.02, 1e6 + 0.02))
rules_abs <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_abs, numeric_abs = 0.01)
compare_datasets_from_yaml(ref_num, cand_ok, key = "id", path = rules_abs)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_num, cand_nok, key = "id", path = rules_abs)$all_passed
#> [1] FALSEThe same threshold 0.01 applies whether the value is
1 or 1 000 000.
rel)The threshold is proportional to the reference value — useful when you want to accept a percentage deviation:
rules_rel <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_rel,
numeric_abs = 0, numeric_rel = 0.01)
# ref = 1000, diff = 9, threshold = 0.01 × 1000 = 10 → PASS
cand_pct <- data.frame(id = 1:3, price = c(1.009, 1009.0, 1e6 * 1.009))
compare_datasets_from_yaml(ref_num, cand_pct, key = "id", path = rules_rel)$all_passed
#> [1] TRUEWarning: if a reference value is
0, the relative threshold is0and any difference will be flagged as an error. Useabsas a safety floor.
Combine both parameters when values span a wide range including near-zero:
by_type:
numeric:
abs: 0.001 # floor: protects against false positives when ref ≈ 0
rel: 0.005 # +0.5% for larger valuesFor ref = 1 000 000:
threshold = 0.001 + 0.005 × 1 000 000 = 5000.001
Rule of thumb: keep
rel: 0(the default) unless you explicitly need a tolerance proportional to the magnitude of the data.
Floating-point subtraction can introduce rounding errors:
datadiff automatically adds a correction of
8 × .Machine$double.eps × |ref| to the threshold to absorb
these representation errors without meaningfully widening the
user-specified tolerance.
warn_at and stop_atThese two parameters control the pointblank action thresholds, expressed as the fraction of rows that fail a validation step:
result <- compare_datasets_from_yaml(
ref, cand,
key = "id",
warn_at = 0.05, # warn if > 5% of rows fail any step
stop_at = 0.20 # stop (error) if > 20% of rows fail any step
)The default (1e-14) means that any single
failing row triggers the threshold, which is appropriate for
data validation where zero differences are expected. Raise these values
if you want the report to remain green while a small fraction of rows
diverge.
Three independent options control character column comparison:
| Option | Effect |
|---|---|
case_insensitive: yes |
Convert both values to lowercase before comparing |
trim: yes |
Strip leading/trailing whitespace before comparing |
equal_mode: normalized |
Apply both transformations |
ref_txt <- data.frame(id = 1:4, label = c("Hello", "World", "Foo", "Bar"))
cand_txt <- data.frame(
id = 1:4,
label = c("hello", " World ", "FOO", "Baz") # case, spaces, mismatch
)
# Strict: rows 1, 2, 3 fail
rules_strict <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_strict)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
path = rules_strict)$all_passed
#> [1] FALSE
# Relaxed: case + trim — only row 4 ("Baz" vs "Bar") fails
rules_relax <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_relax,
character_case_insensitive = TRUE,
character_trim = TRUE)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
path = rules_relax)$all_passed
#> [1] FALSEColumn-level overrides in by_name apply only to the
specified column, leaving all other character columns unaffected.
The row_validation section checks that the candidate has
the expected number of rows.
ref_rows <- data.frame(id = 1:5, value = 1:5)
cand_ok <- data.frame(id = 1:5, value = 1:5) # 5 rows — exact match
cand_more <- data.frame(id = 1:7, value = 1:7) # 7 rows — 2 extra
rules_count <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_count,
check_count_default = TRUE,
expected_count_default = 5,
row_count_tolerance_default = 0)
compare_datasets_from_yaml(ref_rows, cand_ok, key = "id",
path = rules_count)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
path = rules_count)$all_passed
#> [1] FALSEWith a tolerance:
rules_tol <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_tol,
check_count_default = TRUE,
expected_count_default = 5,
row_count_tolerance_default = 3) # accept 5 ± 3
# 7 rows: |7 - 5| = 2 ≤ 3 → PASS
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
path = rules_tol)$all_passed
#> [1] FALSEWhen expected_count is null in the YAML (or
expected_count_default = NULL in
write_rules_template()), the reference row count is used as
the target.
The na_equal setting controls whether
NA == NA is treated as a pass:
ref_na <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))
cand_na <- data.frame(id = 1:3, value = c(1.0, NA, 3.0)) # identical NAs
# na_equal: yes (default) — NA == NA passes
rules_na_yes <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_yes,
na_equal_default = TRUE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
path = rules_na_yes)$all_passed
#> [1] TRUE
# na_equal: no — NA == NA fails
rules_na_no <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_no,
na_equal_default = FALSE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
path = rules_na_no)$all_passed
#> [1] FALSEna_equal applies to all column types including numeric
(with tolerance), character, logical, and date columns.
Columns listed in ignore_columns_default are excluded
from comparison. Presence/absence checks for those columns are also
skipped:
ref_ign <- data.frame(id = 1:3, value = 1:3, updated_at = Sys.time())
cand_ign <- data.frame(id = 1:3, value = 1:3,
updated_at = Sys.time() + 3600) # different timestamp
rules_ign <- tempfile(fileext = ".yaml")
write_rules_template(ref_ign, key = "id", path = rules_ign,
ignore_columns_default = "updated_at")
compare_datasets_from_yaml(ref_ign, cand_ign, key = "id",
path = rules_ign)$all_passed
#> [1] TRUEColumns present in the reference but absent from the candidate generate a dedicated failing step. Extra columns in the candidate are reported but do not cause a failure:
ref_cols <- data.frame(id = 1:2, a = 1:2, b = 1:2)
cand_cols <- data.frame(id = 1:2, a = 1:2, c = 1:2) # b missing, c extra
result_cols <- compare_datasets_from_yaml(ref_cols, cand_cols, key = "id")
result_cols$missing_in_candidate # b
#> [1] "b"
result_cols$extra_in_candidate # c
#> [1] "c"
result_cols$all_passed # FALSE: b is missing
#> [1] FALSEanalyze_columns()analyze_columns() exposes the column comparison logic
independently — useful for pre-flight checks before running the full
validation:
analysis <- analyze_columns(ref_cols, cand_cols,
ignore_columns = character(0))
str(analysis)
#> List of 6
#> $ cols_reference : chr [1:3] "id" "a" "b"
#> $ cols_candidate : chr [1:3] "id" "a" "c"
#> $ missing_in_candidate: chr "b"
#> $ extra_in_candidate : chr "c"
#> $ common_cols : chr [1:2] "id" "a"
#> $ ignored_cols : chr(0)A key column joins the candidate to the reference, handling different row orders and unequal row counts gracefully:
Rows are compared position by position. Both datasets must have the same number of rows:
Multiple columns can form a composite key:
ref_comp <- data.frame(
year = c(2023, 2023, 2024),
month = c(1, 2, 1),
value = c(100, 200, 300)
)
cand_comp <- data.frame(
year = c(2024, 2023, 2023),
month = c(1, 2, 1),
value = c(300, 200, 100)
)
result_comp <- compare_datasets_from_yaml(ref_comp, cand_comp,
key = c("year", "month"))
result_comp$all_passed
#> [1] TRUEThe key parameter to
compare_datasets_from_yaml() takes precedence over the
keys field in the YAML defaults section. This
lets you reuse a shared YAML file while overriding the join key
programmatically:
rules_key <- tempfile(fileext = ".yaml")
write_rules_template(ref_comp, key = "year", path = rules_key) # YAML says year
# Override at call time with the composite key
result_override <- compare_datasets_from_yaml(
ref_comp, cand_comp,
key = c("year", "month"), # overrides YAML
path = rules_key
)
result_override$all_passedIf key values are not unique, datadiff warns before
running the comparison:
ref_dup <- data.frame(id = c(1, 1, 2), value = c(10, 11, 20))
cand_dup <- data.frame(id = c(1, 2), value = c(10, 20))
tryCatch(
compare_datasets_from_yaml(ref_dup, cand_dup, key = "id"),
warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Duplicate keys detected! The key column(s) [id] must be unique in both datasets.
#> - data_reference: 1 duplicate key value(s) affecting 2 rows (examples: id = 1)
#> Comparison results will be unreliable: the join will produce multiple rows per key, leading to incorrect or non-deterministic validation results.
#> Please ensure your key column(s) uniquely identify each row, or choose different key column(s).When a column has incompatible types in reference and candidate,
datadiff warns and adds a dedicated failing step — instead
of silently coercing or crashing:
ref_type <- data.frame(id = 1:2, year = c(2023L, 2024L)) # integer
cand_type <- data.frame(id = 1:2, year = c("2023", "2024")) # character
tryCatch(
compare_datasets_from_yaml(ref_type, cand_type, key = "id"),
warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Type mismatch detected in 1 column(s): 'year' (reference: integer, candidate: character). Each will be reported as a validation error.integer and numeric are treated as
compatible types — tolerance arithmetic works correctly across them and
no mismatch is raised.
detect_column_types()Returns the datadiff type inferred for each column
("integer", "numeric",
"character", "date", "datetime",
"logical"):
df_types <- data.frame(
id = 1L,
amount = 1.5,
label = "x",
flag = TRUE,
day = Sys.Date(),
timestamp = Sys.time()
)
detect_column_types(df_types)
#> id amount label flag day timestamp
#> "integer" "numeric" "character" "logical" "date" "datetime"These are the same types used to match columns against
by_type rules in the YAML.
derive_column_rules()Shows the merged per-column rules for a given dataset and rules
object — equivalent to result$applied_rules but callable
without running the full comparison:
analyze_columns()Already shown in section 10. Useful to quickly check which columns are common, missing, or extra before committing to a full validation run.
preprocess_dataframe()Applies text normalization rules to a dataframe. Useful for inspecting what the data looks like after normalization, before comparing:
add_tolerance_columns()Adds the __absdiff, __thresh, and
__ok diagnostic columns to a comparison dataframe. Useful
for debugging which rows are right on the edge of the tolerance
threshold:
cmp <- data.frame(
value = c(1.005, 1.02, 1.0),
value__reference = c(1.000, 1.00, 1.0)
)
rules_debug <- list(value = list(abs = 0.01, rel = 0))
cmp_annotated <- add_tolerance_columns(cmp, "value", rules_debug,
ref_suffix = "__reference",
na_equal = TRUE)
cmp_annotated[, c("value__absdiff", "value__thresh", "value__ok")]
#> value__absdiff value__thresh value__ok
#> 1 0.005 0.01 TRUE
#> 2 0.020 0.01 FALSE
#> 3 0.000 0.01 TRUEBy default, pointblank reports are rendered in English. You can change the language per call or globally for a session.
Set once in your script or .Rprofile and all subsequent
calls will use it:
options(datadiff.lang = "fr",
datadiff.locale = "fr_FR")
# All calls now produce French reports without passing lang/locale every time
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)Supported languages include "en", "fr",
"de", "it", "es",
"pt", "zh", "ja",
"ru". See the pointblank documentation for the full
list.
Any SQL-backed table wrapped in dplyr::tbl() can be
passed directly. The join, normalization, and boolean expressions are
pushed down to SQL — no data is loaded into R until the final slim
result table:
library(DBI)
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "reference", ref)
DBI::dbWriteTable(con, "candidate", cand)
tbl_ref <- dplyr::tbl(con, "reference")
tbl_cand <- dplyr::tbl(con, "candidate")
result_lazy <- compare_datasets_from_yaml(
tbl_ref, tbl_cand,
key = "id",
path = rules_path
)
result_lazy$all_passed
DBI::dbDisconnect(con)This works with any DBI-compatible backend: SQLite, PostgreSQL, Snowflake, etc.
For files too large to fit in RAM, pass
arrow::open_dataset() directly — the package handles the
Arrow → DuckDB conversion internally with a single private
connection:
library(arrow)
ds_ref <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")
# Generate a template from the schema (no data loaded into RAM)
write_rules_template(ds_ref, key = "id", path = "rules.yaml")
result <- compare_datasets_from_yaml(
data_reference = ds_ref,
data_candidate = ds_cand,
key = "id",
path = "rules.yaml",
duckdb_memory_limit = "8GB" # tune to your machine's RAM
)
result$all_passedDo not call
arrow::to_duckdb()yourself before passing todatadiff. The package opens its own private DuckDB connection; passing pre-converted tables from a different connection will cause a cross-connection join error.
| Machine RAM | Recommended duckdb_memory_limit |
|---|---|
| 8 GB | "3GB" |
| 16 GB | "6GB" |
| 32 GB | "8GB" (default) |
| 64 GB+ | "20GB" |
| Function | Role |
|---|---|
write_rules_template() |
Generate a YAML rules template from a reference dataset |
read_rules() |
Load and validate a YAML rules file |
compare_datasets_from_yaml() |
Compare reference and candidate datasets |
detect_column_types() |
Inspect the type inferred for each column |
derive_column_rules() |
See the merged per-column rules for a dataset + rules pair |
analyze_columns() |
Compare column structure between two datasets |
preprocess_dataframe() |
Apply text normalization rules to a dataframe |
add_tolerance_columns() |
Add __absdiff, __thresh, __ok
columns for debugging |