Diagnostic Functions Guide

tidyaudit includes tidyverse ports of the diagnostic functions from dtaudit. These functions help you understand joins, validate keys, compare tables, diagnose missing values and string quality, and filter with full visibility.

Join diagnostics

validate_join() analyzes a potential join without performing it, reporting match rates, relationship type, duplicate keys, and unmatched rows.

orders <- data.frame(
  id     = c(1L, 2L, 3L, 3L, 4L, 5L),
  amount = c(100, 200, 150, 175, 300, 50)
)
customers <- data.frame(
  id   = c(2L, 3L, 6L),
  name = c("Alice", "Bob", "Carol")
)

validate_join(orders, customers, by = "id")
#> 
#> ── Join Validation: orders ↔ customers ─────────────────────────────────────────
#> Keys in orders: id
#> Keys in customers: id
#> 
#>   Item                                               Value
#>   ───────────────────────────────────────────  ───────────
#>   Relationship                                 many-to-one
#>   Key(s) in orders   [id]                          (1 col)
#>   Key(s) in customers   [id]                       (1 col)
#>   Rows in orders                                         6
#>   Distinct key combos in orders                          5
#>   Rows in customers                                      3
#>   Distinct key combos in customers                       3
#>   Overlapping distinct key combos                        2
#>   Matched row pairs (cartesian)                          3
#>   Match rate from orders                            50.00%
#>   Match rate from customers                         66.67%
#>   Rows only in orders (no match in customers)            3
#>   Rows only in customers (no match in orders)            1
#> 
#> Duplicates: orders=yes customers=no

Different key names

When the key columns have different names, use a named vector:

products <- data.frame(prod_id = 1:3, price = c(10, 20, 30))
sales    <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7))

validate_join(products, sales, by = c("prod_id" = "item_id"))
#> 
#> ── Join Validation: products ↔ sales ───────────────────────────────────────────
#> Keys in products: prod_id
#> Keys in sales: item_id
#> 
#>   Item                                             Value
#>   ─────────────────────────────────────────  ───────────
#>   Relationship                               one-to-many
#>   Key(s) in products   [prod_id]                 (1 col)
#>   Key(s) in sales   [item_id]                    (1 col)
#>   Rows in products                                     3
#>   Distinct key combos in products                      3
#>   Rows in sales                                        3
#>   Distinct key combos in sales                         2
#>   Overlapping distinct key combos                      2
#>   Matched row pairs (cartesian)                        3
#>   Match rate from products                        66.67%
#>   Match rate from sales                          100.00%
#>   Rows only in products (no match in sales)            1
#>   Rows only in sales (no match in products)            0
#> 
#> Duplicates: products=no sales=yes

Stat tracking

Track the impact on a numeric column with stat (same column name in both tables) or stat_x/stat_y (different column names):

x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400))
y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30))

validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost")
#> 
#> ── Join Validation: x ↔ y ──────────────────────────────────────────────────────
#> Keys in x: id
#> Keys in y: id
#> 
#>   Item                                  Value
#>   ───────────────────────────────  ──────────
#>   Relationship                     one-to-one
#>   Key(s) in x   [id]                  (1 col)
#>   Key(s) in y   [id]                  (1 col)
#>   Rows in x                                 4
#>   Distinct key combos in x                  4
#>   Rows in y                                 3
#>   Distinct key combos in y                  3
#>   Overlapping distinct key combos           2
#>   Matched row pairs (cartesian)             2
#>   Match rate from x                    50.00%
#>   Match rate from y                    66.67%
#>   Rows only in x (no match in y)            2
#>   Rows only in y (no match in x)            1
#> 
#> ── Stat diagnostics ────────────────────────────────────────────────────────────
#> 
#> revenue in x:
#> • Total: 1,000
#> • Matched: 500 (50.00%)
#> • Unmatched: 500 (50.00%)
#> 
#> cost in y:
#> • Total: 60
#> • Matched: 30 (50.00%)
#> • Unmatched: 30 (50.00%)
#> 
#> Duplicates: x=no y=no

Key validation

Primary keys

validate_primary_keys() tests whether a set of columns uniquely identify every row:

df <- data.frame(
  id    = c(1L, 2L, 3L, 3L, 4L),
  group = c("A", "A", "B", "C", "A"),
  value = c(10, 20, 30, 40, 50)
)

# Single column — not unique
validate_primary_keys(df, "id")
#> 
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key column: id
#> 
#>   Metric                   Value
#>   ───────────────────────  ─────
#>   Total rows                   5
#>   Unique key combinations      4
#>   Duplicate key combos         1
#> 
#> ✖ NO - Keys do NOT uniquely identify all rows.
#> 
#> Duplicate keys (showing up to 10):
#>   id n
#> 1  3 2

# Composite key — unique
validate_primary_keys(df, c("id", "group"))
#> 
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key columns: id and group
#> 
#>   Metric                   Value
#>   ───────────────────────  ─────
#>   Total rows                   5
#>   Unique key combinations      5
#>   Duplicate key combos         0
#> 
#> ✔ YES - Keys uniquely identify all rows.

Variable relationships

validate_var_relationship() determines the relationship between two columns:

df2 <- data.frame(
  dept    = c("Sales", "Sales", "Engineering", "Engineering"),
  manager = c("Ann", "Ann", "Bob", "Bob")
)
validate_var_relationship(df2, "dept", "manager")
#> 
#> ── Variable Relationship Validation ────────────────────────────────────────────
#> Table: df2
#> Variables: dept ↔ manager
#> 
#>   Metric                        Value
#>   ────────────────────────────  ─────
#>   Unique values in dept             2
#>   Unique values in manager          2
#>   Unique (dept, manager) pairs      2
#> 
#> dept → manager: one-to-one
#> manager → dept: one-to-one
#> 
#> Relationship: ONE-TO-ONE

Table comparison

compare_tables() compares two data.frames by examining columns, row counts, key overlap, and numeric discrepancies:

before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0))
after  <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0))

compare_tables(before, after)
#> 
#> ── Table Comparison: before vs after ───────────────────────────────────────────
#> 1. Row counts
#> before: 5 rows
#> after: 5 rows
#> Difference: =
#> 
#> 2. Column names
#> Matching columns: 2
#> Only in before: 0
#> Only in after: 0
#> Type mismatches: 0
#> 
#> 3. Key columns
#> Key columns: id (auto-detected)
#> Distinct combos in before: 5
#> Distinct combos in after: 5
#> Matching combos: 5
#> Only in before: 0
#> Only in after: 0
#> 
#> 4. Numeric discrepancies (absolute differences)
#> Comparing after merging on keys.
#> Rows matched: 5
#> 
#>     Column  N  Min  Q25  Median  Q75  Max
#>     ──────  ─  ───  ───  ──────  ───  ───
#>     value   5    0    0       0  2.5    5

Filter diagnostics

filter_keep() and filter_drop() filter data while printing diagnostics about what was removed.

filter_keep

Keeps rows where the condition is TRUE (same as dplyr::filter()):

sales <- data.frame(
  id     = 1:10,
  amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60),
  status = rep(c("valid", "suspect"), 5)
)

result <- filter_keep(sales, amount > 100, .stat = amount)
#> filter_keep(sales, amount > 100)
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).

filter_drop

Drops rows where the condition is TRUE (the inverse):

result2 <- filter_drop(sales, status == "suspect", .stat = amount)
#> filter_drop(sales, status == "suspect")
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).

Warning thresholds

Set .warn_threshold to get a warning when too many rows are dropped:

filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5)
#> filter_keep(sales, amount > 1000)
#> Dropped 7 of 10 rows (70.00%).
#> Dropped 1,470 of 7,670 for amount (19.17%).
#> Warning: Dropped 70.0% of rows exceeds threshold (50.0%).
#>   id amount status
#> 1  3   1200  valid
#> 2  5   3000  valid
#> 3  9   2000  valid

Data quality

Missing value diagnosis

diagnose_nas() reports NA counts and percentages for every column:

messy <- data.frame(
  id    = 1:6,
  name  = c("A", NA, "C", "D", NA, "F"),
  score = c(10, 20, NA, NA, 50, NA),
  grade = c("A", "B", "C", NA, "A", "B")
)

diagnose_nas(messy)
#> 
#> ── Missing Value Diagnosis ─────────────────────────────────────────────────────
#> 3 of 4 columns have missing values
#> 
#>   Variable  N NA  Pct NA
#>   ────────  ────  ──────
#>   score        3   50.0%
#>   name         2   33.3%
#>   grade        1   16.7%

Column summaries

summarize_column() gives type-appropriate statistics for a single vector:

summarize_column(c(1, 2, 3, NA, 5, 10, 100))
#>               type           n_unique            missing      most_frequent 
#>          "numeric"                "6"                "1"                 NA 
#>               mean                 sd                min                q25 
#> "20.1666666666667" "39.2398606861271"                "1"             "2.25" 
#>                q50                q75                max           example1 
#>                "4"             "8.75"              "100"                "1" 
#>           example2           example3 
#>                "2"                "3"
summarize_column(c("apple", "banana", "apple", "cherry", NA))
#>          type      n_unique       missing most_frequent          mean 
#>   "character"           "3"           "1"       "apple"            NA 
#>            sd           min           q25           q50           q75 
#>            NA       "apple"            NA            NA            NA 
#>           max      example1      example2      example3 
#>      "cherry"       "apple"      "banana"       "apple"

get_summary_table() applies this to all columns (or selected ones):

get_summary_table(messy)
#>   variable      type n_unique missing most_frequent             mean
#> 1       id   numeric        6       0          <NA>              3.5
#> 2     name character        4       2             A             <NA>
#> 3    score   numeric        3       3          <NA> 26.6666666666667
#> 4    grade character        3       1             A             <NA>
#>                 sd min  q25  q50  q75 max example1 example2 example3
#> 1 1.87082869338697   1 2.25  3.5 4.75   6        1        2        3
#> 2             <NA>   A <NA> <NA> <NA>   F        A        C        D
#> 3 20.8166599946613  10   15   20   35  50       10       20       50
#> 4             <NA>   A <NA> <NA> <NA>   C        A        B        C

String cleaning

These functions require the stringi package (listed in Suggests).

diagnose_strings

diagnose_strings() audits a character vector for common quality issues:

firms <- c("Apple", "APPLE", "apple", "  Microsoft ", "Google", NA, "")
diagnose_strings(firms)
#> 
#> ── String Column Diagnosis: firms ──────────────────────────────────────────────
#> Total elements: 7
#> 
#> Missing & Empty:
#> • NA values: 1 (14.3%)
#> • Empty strings: 1 (14.3%)
#> • Whitespace-only: 0 (0.0%)
#> 
#> Whitespace Issues:
#> • Leading whitespace: 1
#> • Trailing whitespace: 1
#> 
#> Encoding:
#> • Non-ASCII chars: 0
#> 
#> Case Inconsistencies:
#> • Variant groups: 1
#> • Total variants: 3
#> 
#> Case variant examples (up to 5 groups):
#>  lower n_variants            examples
#>  apple          3 Apple, APPLE, apple

audit_transform

audit_transform() shows exactly what a transformation function changes:

audit_transform(firms, trimws)
#> 
#> ── String Transformation Audit: firms ──────────────────────────────────────────
#> Function: trimws
#> 
#>   Metric                        Value
#>   ──────────────  ───────────────────
#>   Total elements                    7
#>   NA values                         1
#>   Changed         1 (16.7% of non-NA)
#>   Unchanged                         6
#> 
#> Examples of changes (showing 1 of 1):
#>        before     after
#>    Microsoft  Microsoft
#> 
#> Access cleaned vector with: `result$cleaned`
audit_transform(firms, tolower)
#> 
#> ── String Transformation Audit: firms ──────────────────────────────────────────
#> Function: tolower
#> 
#>   Metric                        Value
#>   ──────────────  ───────────────────
#>   Total elements                    7
#>   NA values                         1
#>   Changed         4 (66.7% of non-NA)
#>   Unchanged                         3
#> 
#> Examples of changes (showing 4 of 4):
#>        before        after
#>         Apple        apple
#>         APPLE        apple
#>    Microsoft    microsoft 
#>        Google       google
#> 
#> Access cleaned vector with: `result$cleaned`