| Type: | Package |
| Title: | Incremental Feature Engineering with Database Persistence |
| Version: | 0.1.0 |
| Author: | Rudolfs Kregers [aut, cre] |
| Maintainer: | Rudolfs Kregers <rudolfs.kregers@gmail.com> |
| Description: | Define feature logic, compute only new or unprocessed rows, and persist the resulting flat feature table in a database. The package provides an explicit incremental pipeline for fetching source rows, computing feature definitions, and writing computed features to a database table. |
| License: | GPL-3 |
| BugReports: | https://github.com/LordRudolf/featdelta/issues |
| Encoding: | UTF-8 |
| Imports: | DBI, rlang |
| Suggests: | RSQLite, knitr, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | no |
| Packaged: | 2026-05-08 18:21:14 UTC; Rudolfs |
| Repository: | CRAN |
| Date/Publication: | 2026-05-13 08:00:02 UTC |
Define a multi-column feature block
Description
Creates a block definition object for use inside fd_define().
Usage
fd_block(x, expected_names = NULL, envir = NULL)
Arguments
x |
Block definition input. Must be either a function or a single
expression-like input that later evaluates to a |
expected_names |
Optional character vector of expected output column
names. If supplied, names must be non-empty, non- |
envir |
Optional environment used when normalizing expression-like
inputs. If |
Details
A block represents one ordered definition step that may produce multiple
feature columns when later evaluated by fd_compute(). Unlike ordinary
named single-column definitions, a block may return a whole data.frame
whose column names become the produced feature names.
Supported block inputs are:
an inline expression, including braced expressions such as
{ data.frame(...) }a quosure
an
expression()object of length 1a quoted call or symbol
a function, intended to receive the current working data and return a
data.frame
fd_block() does not compute anything by itself. It only stores the block
definition in a normalized form so that fd_define() can include it in the
ordered definitions object.
Value
An object of class "featdelta_block".
See Also
Other featdelta defs helpers:
fd_define(),
print.featdelta_defs(),
summary.featdelta_defs()
Examples
# Inline expression block
blk <- fd_block({
data.frame(
hp_per_cyl = hp / cyl,
disp_per_cyl = disp / cyl
)
})
# Function-based block
make_engine_features <- function(data) {
data.frame(
hp_per_cyl = data$hp / data$cyl,
disp_per_cyl = data$disp / data$cyl
)
}
blk <- fd_block(make_engine_features)
# Optional expected names
blk <- fd_block(
{
data.frame(
hp_per_cyl = hp / cyl
)
},
expected_names = c("hp_per_cyl", "disp_per_cyl")
)
Compute features from featdelta definitions on in-memory data
Description
Evaluates a featdelta_defs object created by fd_define() against an
in-memory data frame/tibble and returns a feature data frame containing the
key column plus computed feature columns.
Usage
fd_compute(
data,
defs,
key,
compute_envir = NULL,
compute_strict = TRUE,
verbose = FALSE,
return_report = FALSE
)
Arguments
data |
A data frame/tibble containing the raw input variables used to compute features. |
defs |
A |
key |
Character scalar naming the primary key column in |
compute_envir |
Optional environment. If supplied, it overrides the stored evaluation environment for expression-based steps. |
compute_strict |
Logical. If |
verbose |
Logical. If |
return_report |
Logical. If |
Details
fd_compute() is the package's pure in-memory execution step. It does not
access the database and does not use the featdelta context system.
Definition steps are evaluated sequentially in the order stored in defs.
This means later steps may depend on columns created by earlier steps in the
same compute call.
Guardrails implemented by fd_compute() are compute-specific:
validate the structure of the supplied defs object
validate result type and row alignment for each step
preserve 1:1 row alignment with the input data
reject output-name collisions with
keyand previously produced namesattach a per-step computation report
Guardrails that belong at definition time, such as malformed construction of
individual definition steps, should be handled by fd_define().
Supported definition step types are:
ordinary single-column steps
-
fd_block()multi-column steps
For single-column steps, the result must be a supported vector-like output.
For block steps:
the result must be a
data.framethe number of rows must match
nrow(data)returned column names must be non-empty and unique
returned names must not collide with
keyreturned names must not collide with feature names already produced earlier in the same compute run
If a block declares expected_names, then:
returned names must be a subset of
expected_namesmissing expected names are added as
NAfinal output order follows
expected_names
Because definitions are evaluated sequentially, later steps may use:
raw input columns from
datacolumns produced by earlier single-column steps
columns produced by earlier block steps
Reordering or removing earlier steps may therefore break downstream
definitions. When evaluation fails with "object not found"-style errors,
fd_compute() augments the message with likely reasons.
Value
Either:
a data frame with class
"featdelta_features"and attached report attribute, ora list
{data, report}with class"featdelta_compute_result"whenreturn_report = TRUE
The output preserves row order and row count relative to data.
Examples
raw_df <- mtcars
raw_df$car_id <- seq_len(nrow(raw_df))
defs <- fd_define(
hp_per_cyl = hp / cyl,
engine_ratios = fd_block({
data.frame(
disp_per_cyl = disp / cyl,
wt_per_hp = wt / hp
)
}),
double_ratio = disp_per_cyl * 2
)
out <- fd_compute(
data = raw_df,
defs = defs,
key = "car_id",
compute_strict = TRUE
)
head(out)
Connect to a database with featdelta defaults
Description
Creates a featdelta_con context object from a DBI driver and connection
arguments. The returned object stores the live DBI connection plus optional
featdelta defaults such as the raw table, feature table, key column, and
metadata settings.
Usage
fd_connect(
driver,
...,
raw_table = NULL,
raw_table_name = NULL,
feat_table_name = NULL,
key = NULL,
meta_enabled = FALSE,
meta_schema = NULL
)
Arguments
driver |
A DBI driver object, such as |
... |
Additional arguments passed to |
raw_table |
Optional character scalar naming the raw/source table. |
raw_table_name |
Deprecated alias for |
feat_table_name |
Optional character scalar naming the feature table. |
key |
Optional character scalar naming the primary key column. |
meta_enabled |
Logical. Whether featdelta metadata tracking is enabled. |
meta_schema |
Optional list of metadata schema settings. |
Value
A featdelta_con object.
Define featdelta feature definitions
Description
Creates a featdelta_defs object that stores ordered definition steps in a
normalized internal representation suitable for later evaluation by
fd_compute().
Usage
fd_define(
...,
defs = NULL,
description = NULL,
overwrite = FALSE,
envir = NULL
)
Arguments
... |
Definitions supplied inline.
Ordinary named inputs such as |
defs |
Optional list of programmatically supplied definitions.
Supported inputs are the same as for |
description |
Optional character scalar describing the definitions. |
overwrite |
Logical. If |
envir |
Optional environment used as the explicit evaluation
environment for normalized definitions. If |
Details
fd_define() is the definitions-construction step of the package. It does
not compute features and does not access the database. Its role is to:
capture user definitions,
normalize supported input styles to a canonical internal structure,
enforce definition-level guardrails such as name validity and duplicate handling,
store the resulting ordered definition steps for later execution.
Definitions are evaluated later by fd_compute() in the order they are
stored. This means later definitions may depend on columns created by
earlier steps. Reordering or removing earlier steps may therefore break
downstream definitions.
Supported definition inputs include:
inline expressions captured from
...quosures
-
expression()objects of length 1 quoted calls or names
direct scalar constants
-
fd_block()objects for multi-column feature generation
Ordinary named expressions define one output column per step.
fd_block() defines a multi-column step. At compute time, a block is
expected to return a data.frame with one row per input row. The returned
column names become the produced feature names. Optionally, expected output
names may be declared in advance via fd_block(expected_names = ...).
Direct atomic constants must be scalar. If vectorized behavior is desired, provide an expression instead of embedding a fixed-length atomic vector in the definitions.
Duplicate step handling is part of definitions construction:
with
overwrite = FALSE, duplicate step names error;with
overwrite = TRUE, the last definition wins.
Value
A list of class "featdelta_defs" containing:
-
steps: ordered normalized definition steps -
description: optional description -
created_at: creation time -
defs_version: internal version marker -
envir_policy: whether environments were inherited or explicit -
envir: the explicit environment, if supplied
See Also
Other featdelta defs helpers:
fd_block(),
print.featdelta_defs(),
summary.featdelta_defs()
Examples
# Single-column definitions
defs <- fd_define(
hp_per_cyl = hp / cyl,
wt_per_hp = wt / hp
)
# Definitions can also be supplied programmatically
predef <- expression(log(hp))
defs <- fd_define(
log_hp = predef
)
# Multi-column block using an inline expression
defs <- fd_define(
engine_ratios = fd_block({
data.frame(
hp_per_cyl = hp / cyl,
disp_per_cyl = disp / cyl
)
})
)
# Multi-column block with declared expected names
defs <- fd_define(
engine_ratios = fd_block(
{
data.frame(
hp_per_cyl = hp / cyl,
disp_per_cyl = disp / cyl
)
},
expected_names = c("hp_per_cyl", "disp_per_cyl")
)
)
# Function-based block
make_engine_features <- function(data) {
data.frame(
hp_per_cyl = data$hp / data$cyl,
disp_per_cyl = data$disp / data$cyl
)
}
defs <- fd_define(
engine_ratios = fd_block(make_engine_features)
)
# A block can contain a small feature script with temporary variables
defs <- fd_define(
engine_script = fd_block({
hp_per_cyl <- hp / cyl
disp_per_cyl <- disp / cyl
data.frame(
hp_per_cyl = hp_per_cyl,
disp_per_cyl = disp_per_cyl,
engine_index = hp_per_cyl + disp_per_cyl
)
})
)
# A function-based block can create a variable number of columns in a loop
make_scaled_features <- function(data) {
vars <- c("hp", "disp", "wt")
out <- list()
for (var in vars) {
center <- mean(data[[var]], na.rm = TRUE)
spread <- stats::sd(data[[var]], na.rm = TRUE)
out[[paste0(var, "_scaled")]] <- (data[[var]] - center) / spread
}
as.data.frame(out)
}
defs <- fd_define(
scaled_inputs = fd_block(make_scaled_features)
)
Fetch source rows that are not yet present in the features table
Description
fd_fetch() executes a user-supplied SQL SELECT query against a database and
returns only those rows whose primary key (key) is not present in an
existing features table (feat_table_name).
Usage
fd_fetch(con, sql, key, feat_table_name, use_max_key = FALSE, verbose = FALSE)
Arguments
con |
A live |
sql |
A single SQL string (typically a The query must be usable as a derived table: |
key |
Name of the primary key column (character scalar). Must exist in
both the result of |
feat_table_name |
Name of the existing features table in the database
(character scalar). Must exist. This table is used only to identify which
|
use_max_key |
Logical. If |
verbose |
Logical. If |
Details
This function is intentionally not a general-purpose query runner. It always
applies a "not yet processed" filter against feat_table_name and errors if that
cannot be done honestly (e.g., missing tables/columns).
Important assumption
fd_fetch() assumes that:
the
sqlquery is executed on the same database (samecon) wherefeat_table_nameis stored, andthe returned
keyvalues are comparable tofeat_table_name.key.
Cross-database fetching (e.g., pulling data from one database and comparing to
a features table stored in another database) is not supported, because the
"not present in feat_table_name" filter must be evaluated by the database engine
in a single query context.
Value
A data.frame containing only rows returned by sql whose key is not
present in feat_table_name. The result includes an attribute attr(x, "fd_fetch")
(a list) with metadata such as key, feat_table_name, use_max_key, max_key
(if computed), executed_sql, and n_rows.
Examples
if (requireNamespace("RSQLite", quietly = TRUE)) {
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbExecute(con, "CREATE TABLE raw (id INTEGER, x INTEGER)")
DBI::dbExecute(con, "CREATE TABLE r_variables_table (id INTEGER)")
DBI::dbExecute(con, "INSERT INTO raw (id, x) VALUES (1,10), (2,20), (3,30), (4,40), (5,50)")
DBI::dbExecute(con, "INSERT INTO r_variables_table (id) VALUES (1), (2), (4)")
# Returns ids 3 and 5 only (rows not yet present in the features table)
new_rows <- fd_fetch(
con = con,
sql = "SELECT * FROM raw",
key = "id",
feat_table_name = "r_variables_table"
)
}
Run the featdelta incremental feature pipeline
Description
Executes the standard featdelta pipeline:
Usage
fd_run(
con,
sql,
defs,
key,
feat_table_name,
verbose = FALSE,
fetch_mode = c("new_only", "all"),
use_max_key = FALSE,
fetch_limit = NULL,
compute_strict = TRUE,
compute_envir = NULL,
create_table = "auto",
alter_table = TRUE,
update_table = TRUE,
dialect = NULL,
chunk_size = NULL,
fail_fast = TRUE,
return_data = c("none", "features", "raw", "both"),
preview_n = 10L,
...
)
Arguments
con |
A live |
sql |
SQL query used to fetch raw rows. |
defs |
A |
key |
Character scalar naming the key column shared by raw data and feature table. |
feat_table_name |
Character scalar naming the target feature table. |
verbose |
Logical. If |
fetch_mode |
Fetch mode. |
use_max_key |
Logical. Passed to |
fetch_limit |
Optional positive row limit applied after fetching. This is intended for previews and small dry development runs, not as a SQL optimizer. |
compute_strict |
Logical strictness flag passed to |
compute_envir |
Optional environment override passed to |
create_table |
Logical or |
alter_table |
Logical. Passed to |
update_table |
Logical. Passed to |
dialect |
Optional dialect override. Supported values are |
chunk_size |
Optional chunk size. Passed to |
fail_fast |
Logical. If |
return_data |
Controls whether raw and/or computed data should be included in the returned run report. |
preview_n |
Non-negative number of rows to include in report previews. |
... |
Reserved for future context options passed to |
Details
fd_fetch() -> fd_compute() -> fd_upsert()
fd_run() is the orchestration entry point that connects database fetching,
in-memory computation, and database upsert into a single incremental run.
fetch_mode = "new_only" is the default incremental mode. If the feature
table already exists, fd_fetch() returns only rows whose key is missing
from the feature table. If the feature table does not exist yet, all rows
returned by sql are fetched because no rows have been processed. This mode
is key-based: it does not recompute existing feature-table rows just because
feature definitions changed or new feature definitions were added.
fetch_mode = "all" is an explicit refresh/backfill mode. It recomputes all
rows returned by sql and passes them to fd_upsert(). With the default
update_table = TRUE, existing keys are updated and new keys are inserted.
Use this mode when existing feature values should be refreshed, for example
after changing a definition or adding a feature that should be backfilled for
already-processed keys.
fd_run() does not currently coordinate concurrent writers. For the MVP,
avoid running multiple fd_run()/fd_upsert() calls against the same feature
table at the same time. Concurrent writes to different feature tables are
independent.
The returned upsert report can include extra_columns: columns that exist in
the target feature table but are not produced by the current definitions.
These columns are left untouched; the package does not drop, rename, or retire
columns automatically.
Value
An object of class "fd_run_report" containing stage summaries,
timings, row counts, the compute report, and the upsert report. Depending
on return_data, it may also contain raw and/or computed feature data.
Examples
if (requireNamespace("RSQLite", quietly = TRUE)) {
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
on.exit(DBI::dbDisconnect(con), add = TRUE)
raw_cars <- mtcars
raw_cars$id <- seq_len(nrow(raw_cars))
DBI::dbWriteTable(con, "raw_cars", raw_cars)
defs <- fd_define(
hp_per_cyl = hp / cyl,
engine_ratios = fd_block({
data.frame(
disp_per_cyl = disp / cyl,
wt_per_hp = wt / hp
)
})
)
res <- fd_run(
con = con,
sql = "SELECT * FROM raw_cars",
defs = defs,
key = "id",
feat_table_name = "features",
verbose = FALSE
)
res$success
}
Upsert computed features into a database features table
Description
Takes a feature data.frame produced by fd_compute() (one row per entity id)
and writes it into a database table, using explicit incremental semantics:
new keys are inserted; existing keys are optionally updated.
Usage
fd_upsert(
con,
features_df,
feat_table_name,
key,
create_table = "auto",
alter_table = TRUE,
update_table = TRUE,
chunk_size = NULL,
verbose = TRUE,
return_report = TRUE,
dialect = NULL
)
Arguments
con |
A |
features_df |
A
|
feat_table_name |
A single string. Name of the database table where features
are stored (the "target" table). For dialects supporting schemas (e.g. Postgres),
this may be schema-qualified (e.g. |
key |
A single string. Name of the primary key column in both |
create_table |
Logical, or character value |
alter_table |
Logical. If |
update_table |
Logical. Controls incremental behavior for keys that already exist
in
|
chunk_size |
Optional integer batch size. If provided, |
verbose |
Logical. If |
return_report |
Logical. If |
dialect |
Optional dialect override. Supported values are |
Details
The function is set-based (staging + merge) and avoids "read all ids into R" patterns. Writes are performed in chunks (optional) inside a transaction.
Schema evolution is supported in a restricted, safe form: when alter_table = TRUE,
missing feature columns are added to the target table via ALTER TABLE ... ADD COLUMN ....
No column drops/renames/type changes are performed. Columns that exist in
the target table but are not present in features_df are left untouched and
reported as extra_columns.
Incremental semantics
Insert new keys (keys in
features_dfnot present infeat_table_name).Update existing keys only when
update_table = TRUE.When
update_table = FALSE, any overlap between staged keys and existing keys is treated as a conflict and aborts the write.
Counts
The returned report contains would_insert / would_update counts, computed as
existence-based counts prior to the merge (within the transaction). These are
not "rows whose values changed", only "rows targeted as insert/update".
Transactions
The operation runs in a transaction. If any step fails (schema change, staging write,
merge), the transaction is rolled back and the target table is unchanged.
Existing target tables used with update_table = TRUE must have a primary
key or unique constraint/index on key. Tables created by fd_upsert() get
this primary key automatically.
Concurrency
fd_upsert() does not currently take an explicit table-level lock. Avoid
running concurrent writes to the same feature table. Concurrent writes to
different feature tables are independent.
Value
An fd_upsert_report object (S3) with a structured summary of actions performed:
-
table_created(logical) -
columns_added(character vector) -
extra_columns(character vector): target-table columns not present in the incoming features data, excludingkey -
counts$would_insert(integer) -
counts$would_update(integer;0whenupdate_table = FALSE) optional per-chunk breakdown (if chunking is used)
Examples
if (requireNamespace("RSQLite", quietly = TRUE)) {
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
on.exit(DBI::dbDisconnect(con), add = TRUE)
feats <- data.frame(
id = c(1, 2, 3),
f_age = c(10, 20, 30),
f_flag = c(TRUE, FALSE, TRUE)
)
# Create table and insert initial rows
r1 <- fd_upsert(
con = con,
features_df = feats,
feat_table_name = "features_tbl",
key = "id",
create_table = TRUE,
alter_table = FALSE,
update_table = TRUE,
verbose = FALSE
)
# Upsert: update ids 2-3, insert id 4
feats2 <- data.frame(
id = c(2, 3, 4),
f_age = c(21, 31, 40),
f_flag = c(FALSE, TRUE, FALSE)
)
r2 <- fd_upsert(
con = con,
features_df = feats2,
feat_table_name = "features_tbl",
key = "id",
update_table = TRUE,
verbose = FALSE
)
# Schema evolution: add a new feature column
feats3 <- data.frame(
id = c(4, 5),
f_age = c(41, 50),
f_flag = c(TRUE, TRUE),
f_new = c("A", "B")
)
r3 <- fd_upsert(
con = con,
features_df = feats3,
feat_table_name = "features_tbl",
key = "id",
alter_table = TRUE,
update_table = TRUE,
verbose = FALSE
)
}
Print a featdelta run report
Description
Print a featdelta run report
Usage
## S3 method for class 'fd_run_report'
print(x, ...)
Arguments
x |
An |
... |
Unused. |
Value
The input object invisibly.
Print a featdelta connection context
Description
Print a featdelta connection context
Usage
## S3 method for class 'featdelta_con'
print(x, ...)
Arguments
x |
A |
... |
Unused. |
Value
The input object invisibly.
Print a featdelta definitions object
Description
Prints a concise human-readable summary of a featdelta_defs object.
Usage
## S3 method for class 'featdelta_defs'
print(x, ...)
Arguments
x |
A |
... |
Unused. |
Value
The input object invisibly.
See Also
Other featdelta defs helpers:
fd_block(),
fd_define(),
summary.featdelta_defs()
Summarize a featdelta definitions object
Description
Returns a compact data-frame summary of the ordered definition steps stored
in a featdelta_defs object.
Usage
## S3 method for class 'featdelta_defs'
summary(object, ...)
Arguments
object |
A |
... |
Unused. |
Value
A data frame with one row per definition step and columns such as step name, type, mode, declared outputs, and stored expression text.
See Also
Other featdelta defs helpers:
fd_block(),
fd_define(),
print.featdelta_defs()