This vignette describes how to benchmark CDMConnector::generateCohortSet (old, CIRCE-based) against atlasCohortGenerator::generateCohortSet2 (new, DAG-optimized batch) across multiple database platforms. The benchmarking script:
Supported platforms include PostgreSQL, Redshift, Snowflake, Spark, and SQL Server. You provide a named list of CDM reference objects; the script handles timing, comparison, and CSV output.
The new approach (generateCohortSet2) uses a
DAG-based batch optimizer that:
As a result, wall-clock time typically decreases as the number of cohorts and the overlap in concept sets increase. The ratio (new time / old time) is often below 1.0, with larger batches showing greater speedups. The exact improvement depends on:
The benchmarking script records time_old_sec,
time_new_sec, and ratio_new_over_old per
database so you can measure the speedup on your own data and
platforms.
devtools::load_all() for the latter).CDMConnector::readCohortSet("path/to/cohorts")).For one CDM, use the single-database benchmark and optional equivalence check:
source("extras/benchmark_cohort_generation.R")
cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
result <- benchmark_cohort_generation(cdm, cohort_set, cohort_path = "path/to/cohorts")
# Compare old vs new cohort tables (identical rows, order ignored)
cmp <- compare_cohort_tables(result$cdm, name_old = "cohort_bench_old", name_new = "cohort_bench_new")
cmp$identical # TRUE if same set of rows
cmp$per_cohort # Per-cohort row counts and match statusPass a named list of CDM objects; names are used as
the database identifier in the output CSVs
(e.g. postgres, redshift,
snowflake, spark,
sql_server):
source("extras/benchmark_cohort_generation.R")
source("extras/benchmark_multi_database.R")
cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
cdms <- list(
postgres = cdm_postgres,
redshift = cdm_redshift,
snowflake = cdm_snowflake,
spark = cdm_spark,
sql_server = cdm_sqlserver
)
run_benchmark_multi_database(
cdms = cdms,
cohort_set = cohort_set,
cohort_path = "path/to/cohorts",
results_csv = "benchmark_results.csv",
equivalence_csv = "benchmark_equivalence.csv"
)database, time_old_sec,
time_new_sec, ratio_new_over_old,
n_cohorts, files_included,
status.n_old, n_new, rows_identical,
status.The timing CSV has one row per database. Example structure:
| database | time_old_sec | time_new_sec | ratio_new_over_old | n_cohorts | files_included | status |
|---|---|---|---|---|---|---|
| postgres | 120.5 | 45.2 | 0.38 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| redshift | 95.2 | 38.0 | 0.40 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| snowflake | 88.1 | 32.5 | 0.37 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| sql_server | 110.3 | 42.1 | 0.38 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
The equivalence CSV confirms that the old and new cohort tables contain the same rows (order ignored). Each database has:
cohort_definition_id NA): total row counts and whether the
full tables match.n_old), from the new table (n_new), and
whether the set of rows for that cohort is identical
(rows_identical).Example:
| database | cohort_definition_id | n_old | n_new | rows_identical | status |
|---|---|---|---|---|---|
| postgres | NA | 15000 | 15000 | TRUE | ok |
| postgres | 1 | 5000 | 5000 | TRUE | ok |
| postgres | 2 | 6000 | 6000 | TRUE | ok |
| postgres | 3 | 4000 | 4000 | TRUE | ok |
| redshift | NA | 15000 | 15000 | TRUE | ok |
| redshift | 1 | 5000 | 5000 | TRUE | ok |
When rows_identical is TRUE for all cohorts (and the overall row), the new approach produces exactly the same cohort membership and dates as the old CIRCE-based method; only execution strategy and performance differ.
| Aspect | Description |
|---|---|
| Performance | The new batch optimizer typically reduces wall-clock time (ratio < 1) by sharing vocabulary and domain work across cohorts. |
| Correctness | The benchmarking pipeline compares old and new cohort tables row-by-row (order ignored) and writes equivalence results to CSV. |
| Platforms | Run the same cohort set on Postgres, Redshift, Snowflake, Spark, and
SQL Server by passing a named list of CDMs to
run_benchmark_multi_database(). |
Use the generated CSVs to document speedups and to confirm identical results across databases and between the two cohort generation methods.