Help for package epidm

Version:

1.0.6

Title:

UK Epidemiological Data Management

Description:

Contains utilities and functions for the cleaning, processing and management of patient level public health data for surveillance and analysis held by the UK Health Security Agency, UKHSA.

URL:

https://github.com/ukhsa-collaboration/epidm

BugReports:

https://github.com/ukhsa-collaboration/epidm/issues

License:

GPL (≥ 3)

Depends:

R (≥ 4.4.0)

Imports:

data.table, DBI, odbc, phonics, purrr, readr, stats, stringi, stringr, utils, lubridate

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.3

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0), patrick, ggplot2, dplyr, tibble

Config/testthat/edition:

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2026-02-20 13:52:56 UTC; Frederick.Sloots

Author:

Alice Graham

[aut], Harshana Liyanage

[aut], Frederick Sloots

[aut, cre], Emma Parker

[aut], Alex Bhattacharya [aut]

Maintainer:

Frederick Sloots <frederick.sloots@ukhsa.gov.uk>

Repository:

CRAN

Date/Publication:

2026-02-25 14:30:03 UTC

Continuous Inpatient (CIP) Spells

Description

Creates Continuous Inpatient (CIP) spells by combining one or more provider spells into a single uninterrupted period of inpatient care. CIP definitions follow the NHS Digital methodology: transfers between providers can be part of the same CIP spell where specific admission, discharge, and timing criteria are met. A CIP spell begins when a patient is admitted under consultant care and ends when they are discharged or die. http://content.digital.nhs.uk/media/11859/Provider-Spells-Methodology/pdf/Spells_Methodology.pdf

Where spells meet the CIP criteria, they are merged into a continuous spell. The output includes a CIP index and the derived start and end dates for the full CIP period.

Usage

cip_spells(
  x,
  group_vars,
  spell_start_date,
  admission_method,
  admission_source,
  spell_end_date,
  discharge_destination,
  patient_classification,
  .forceCopy = FALSE
)

Arguments

x

A data.frame or data.table; will be converted to a data.table (usually HES/SUS data)

group_vars

Character vector of variables used to group records (minimum: a patient identifier).

spell_start_date

Quoted column name containing the provider spell admission date.

admission_method

CDS admission method code.

admission_source

CDS admission source code.

spell_end_date

Quoted column name containing the provider spell discharge date.

discharge_destination

CDS discharge destination code.

patient_classification

CDS patient classification code.

.forceCopy

Logical (default FALSE). If FALSE, the input is converted to a data.table and modified by reference. If TRUE, the input must already be a data.table, and the function will create an explicit copy to avoid modifying the original object.

Value

A data.table containing the original data and three new variables:

cip_indx: Unique identifier for the derived CIP spell.
cip_spell_start: Start date for the continuous inpatient spell.
cip_spell_end: End date for the continuous inpatient spell.

Examples

cip_test <- data.frame(
  id = c('465','465','465','465','8418','8418','8418',
         '8418','8418','8418','8418','8418','26443',
         '26443','26443','33299','33299','33299','33299',
         '33299','33299','33299','33299','33299','33299',
         '52635','52635','52635','52635','52635','52635',
         '52635','52635','52635','52635','52635','52635',
         '52635','52635','52635','52635','52635','52635',
         '52635','52635','52635','52635','52635','52635',
         '52635','52635','52635','78915','78915','78915'),
  provider = c('X1T','X1T','X1T','X1T','KHA','KHA','KHA',
               'KHA','KHA','KHA','KHA','KHA','BX2','BX2',
               'BX2','PXH','PXH','PXH','PXH','PXH','PXH',
               'PXH','PXH','PXH','PXH','9HA','9HA','9HA',
               '9HA','9HA','9HA','9HA','9HA','9HA','9HA',
               '9HA','9HA','9HA','9HA','9HA','9HA','YYT',
               'YYT','YYT','YYT','YYT','YYT','YYT','YYT',
               'YYT','YYT','YYT','ABX','ABX','ABX'),
  spell_start = as.Date(c(
    '2020-03-07','2020-03-07','2020-03-25','2020-04-03','2020-01-25',
    '2020-01-26','2020-07-14','2020-08-02','2020-08-12','2020-08-19',
    '2020-08-19','2020-11-19','2019-11-12','2020-04-17','2020-04-23',
    '2020-07-03','2020-01-17','2020-02-07','2020-03-20','2020-04-27',
    '2020-06-21','2020-07-02','2020-10-17','2020-11-27','2021-01-02',
    '2019-12-31','2020-01-02','2020-01-14','2020-01-16','2020-02-07',
    '2020-02-11','2020-02-14','2020-02-18','2020-02-21','2020-02-25',
    '2020-02-28','2020-03-09','2020-03-11','2020-03-12','2020-03-13',
    '2020-03-14','2020-02-04','2020-02-07','2020-02-11','2020-02-14',
    '2020-02-18','2020-02-21','2020-02-25','2020-02-28','2020-03-09',
    '2020-03-11','2020-03-12','2020-04-16','2020-04-24','2020-05-13')),
  spell_end = as.Date(c(
    '2020-03-07','2020-03-25','2020-04-02','2020-04-27','2020-01-25',
    '2020-01-27','2020-07-17','2020-08-07','2020-08-14','2020-08-19',
    '2020-08-22','2020-12-16','2020-04-17','2020-04-23','2020-05-20',
    '2020-07-24','2020-01-28','2020-02-07','2020-03-23','2020-04-29',
    '2020-06-21','2020-07-03','2020-11-27','2021-01-02','2021-01-10',
    '2019-12-31','2020-01-11','2020-01-14','2020-02-04','2020-02-07',
    '2020-02-11','2020-02-14','2020-02-18','2020-02-21','2020-02-25',
    '2020-02-28','2020-03-09','2020-03-11','2020-03-12','2020-03-13',
    '2020-03-30','2020-02-07','2020-02-11','2020-02-14','2020-02-18',
    '2020-02-21','2020-02-25','2020-02-28','2020-03-09','2020-03-11',
    '2020-03-12','2020-03-13','2020-04-24','2020-05-13','2020-06-11')),
  adm_meth = c('21','81','21','81','21','21','11','21','21','21','21',
               '21','21','81','21','81','21','21','21','21','21','21',
               '21','13','13','12','22','12','2D','13','13','13','13',
               '13','13','13','13','13','13','13','21','81','81','81',
               '81','81','13','81','81','13','13','13','21','11','81'),
  adm_src = c('19','51','19','51','19','51','19','51','19','19','19',
              '51','19','51','19','51','19','19','19','19','19','19',
              '19','51','19','19','19','19','19','19','19','19','19',
              '19','19','19','51','51','51','51','19','51','51','51',
              '51','51','51','51','51','51','51','51','19','51','51'),
  dis_meth = c('1','1','1','1','1','1','1','1','1','1','1','4','1','1',
               '4','1','1','1','1','1','1','1','8','1','4','1','1','1',
               '1','1','1','1','1','1','1','1','1','1','1','1','1','1',
               '1','1','1','1','1','1','1','1','1','1','1','1','2'),
  dis_dest = c('51','51','51','54','51','19','19','19','19','51','19',
               '79','51','51','79','65','19','19','19','19','19','29',
               '98','51','79','19','19','19','51','19','19','19','51',
               '51','51','19','19','51','51','19','51','51','51','51',
               '51','51','51','51','51','51','51','51','29','54','19'),
  patclass = c('1','1','1','1','1','1','1','1','1','1','1','1','1','1',
               '1','1','1','1','1','1','1','1','1','1','1','2','1','2',
               '1','2','2','2','2','2','2','2','2','2','2','2','1','1',
               '1','1','1','1','1','1','1','1','1','1','1','1','1')
)

cip_spells(x=cip_test,
  group_vars = c('id','provider'),
  patient_classification = 'patclass',
  spell_start_date = 'spell_start',
  admission_method = 'adm_meth',
  admission_source = 'adm_src',
  spell_end_date = 'spell_end',
  discharge_destination = 'dis_dest'
)[]

Download a csv from a zip

Description

A convenience function to allow you to pull data from NHS, ONS and ODR assets

Usage

csv_from_zip(x)

Arguments

x

a zip file from the web

Value

a zip file for ingestion into your chosen readr

Examples

## Not run: 
read.csv(csv_from_zip("https://files.digital.nhs.uk/assets/ods/current/pcodeall.zip"))

## End(Not run)

Bacterial Genus Gram Stain Lookup Table

Description

A reference table of bacterial gram stain results by genus to allow faster filtering of bacterial results. This dataset has been maintained manually against the PHE SGSS database. If there are organisms missing, please raise and issue or push request on the epidm GitHub

Usage

genus_gram_stain

Format

A data frame with four columns

organism_genus: The bacterial genus
gram_stain: A character string to indicate POSITIVE or NEGATIVE type
gram_positive: A 0/1 flag to indicate if the genus is gram positive
gram_negative: A 0/1 flag to indicate if the genus is gram negative

A&E attendance discharge destination

Description

In order to group A&E discharge destination from SNOWMED into human readable groups, a lookup table has been created. These work with Emergency Care Dataset (ECDS) data with the destination_code field to show where a patient goes after discharge from A&E.

Usage

group_ecds_discharge_destination

Format

code: the ECDS destination_code
destination_code: the destination grouping as a human readable string

Inpatient admission methods

Description

In order to group hospital inpatient admissions into human readable groups, a lookup table has been created. These work with Hospital Episode Statistics (HES) and Secondary Use Services (SUS) data with the admission_method fields.

Usage

group_inpatient_admission_method

Format

code: the admission_method code
admission_method: the admission_method grouping as a human readable string

Inpatient discharge destination

Description

In order to group hospital inpatient discharge destination into human readable groups, a lookup table has been created. These work with Hospital Episode Statistics (HES) and Secondary Use Services (SUS) data with the discharge_destination fields.

Usage

group_inpatient_discharge_destination

Format

code: the discharge_destination code
discharge_destination: the discharge_destination grouping as a human readable string

Grouping of intervals or events that occur close together in time

Description

A utility function to group together observations that represent overlapping date intervals (e.g., hospital admission spells) or events occurring within a defined time window e.g., specimen dates grouped into infection episodes. The function supports both:

• Interval-based grouping: records have a start and an end date; any overlapping intervals are grouped together. • Event-based grouping: records have only a start date and are grouped using either a static or rolling time window.

The output provides a unique index per group and the minimum/maximum date that define the resulting aggregated episode or interval.

Usage

group_time(
  x,
  date_start,
  date_end,
  window,
  window_type = c("rolling", "static"),
  group_vars,
  indx_varname = "indx",
  min_varname = "date_min",
  max_varname = "date_max",
  .forceCopy = FALSE
)

Arguments

x

A data.frame or data.table containing date variables for grouping. Will be converted to a data.table internally.

date_start

Quoted column name giving the start date for each record.

date_end

column containing the end dates for the interval, quoted

window

an integer representing a time window in days which will be applied to the start date for grouping events

window_type

character, to determine if a 'rolling' or 'static' grouping method should be used when grouping events. A 'static' window will identify the first event, and all records X days from that event will be attributed to the same episode. Eg. in a 14 day window, if first event is on 01 Mar, and events on day 7 Mar and 14 Mar will be grouped, but an event starting 15 Mar days after will start a new episode. A 'rolling' window resets the day counter with each new event. Eg. Events on 01 Mar, 07 Mar, 14 Mar and 15 Mar are all included in a single episode, as will any additional events up until the 29 Mar (assuming a 14-day window).

group_vars

Character vector of quoted column names used to partition the data before grouping.

indx_varname

a character string to set variable name for the index column which provides a grouping key; default is indx

min_varname

a character string to set variable name for the time period minimum

max_varname

a character string set variable name for the time period maximum

.forceCopy

default FALSE; TRUE will force data.table to take a copy instead of editing the data without reference

indx; renamed using indx_varname: an id field for the new aggregated events/intervals; note that where the date_start is NA, an indx value will also be NA
min_date; renamed using min_varname: the start date for the aggregated events/intervals
max_date; renamed using max_varname: the end date for the aggregated events/intervals

Details

How the function works

The behaviour depends on whether date_end is supplied:

1. Interval-based grouping (start + end dates)

If both date_start and date_end are provided, the function identifies overlapping intervals within the same group_vars grouping. Any intervals that overlap are combined into a single episode.

This method is typically used for:

Hospital spells (HES/SUS)
Contact periods or inpatient stays

2. Event-based grouping (single-date events)

If only date_start is supplied, records are grouped using a time window defined by the window argument.

Two approaches are supported:

window_type = "static" A fixed window is applied starting from the first event in the group. All events occurring within the window are grouped until a gap exceeds the threshold, at which point a new episode begins.
window_type = "rolling" A dynamic window where each event extends the episode end point. An event is grouped as long as it occurs within window days of the most recent event in the same episode.

Handling of missing values

Records missing date_start cannot be grouped and are returned with indx = NA. These rows are appended back to the final output.

Value

A data.table containing all original columns plus:

Workflow context

how group_time() might be used in a pipeline 1) SGSS specimen data – infection episode grouping (event-based) After organism/specimen harmonisation (e.g., via lookup_recode()), group_time() groups specimen dates into infection episodes using a defined time window. This helps identify clusters of related positive tests for the same patient and organism.

2) HES/SUS inpatient data – continuous spell grouping (interval-based) When start and end dates of inpatient stays are available, group_time() collapses overlapping intervals into a single continuous hospital spell. This is used before linking SGSS infection episodes to inpatient activity.

3) Integration across datasets The outputs from group_time() are used downstream to determine whether infection events fall within or around periods of hospital care, enabling combined SGSS–HES/SUS–ECDS analyses.

Examples

episode_test <- structure(
  list(
    pat_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
               1L, 1L, 1L, 1L, 2L, 2L, 2L),
    species = c(rep("E. coli",7),rep("K. pneumonia",7)),
    spec_type = c(rep("Blood",7),rep("Blood",4),rep("Sputum",3)),
    sp_date = structure(c(18262, 18263, 18281, 18282, 18262, 18263, 18281,
                          18265, 18270, 18281, 18283, 18259, 18260, 18281),
                        class = "Date")
  ),
  row.names = c(NA, -14L), class = "data.frame")

group_time(x=episode_test,
           date_start='sp_date',
           window=14,
           window_type = 'static',
           indx_varname = 'static_indx',
           group_vars=c('pat_id','species','spec_type'))[]

spell_test <- data.frame(
  id = c(rep(99,6),rep(88,4),rep(3,3)),
  provider = c("YXZ",rep("ZXY",5),rep("XYZ",4),rep("YZX",3)),
  spell_start = as.Date(
    c(
      "2020-03-01",
      "2020-07-07",
      "2020-02-08",
      "2020-04-28",
      "2020-03-15",
      "2020-07-01",
      "2020-01-01",
      "2020-01-12",
      "2019-12-25",
      "2020-03-28",
      "2020-01-01",
      rep(NA,2)
    )
  ),
  spell_end = as.Date(
    c(
      "2020-03-10",
      "2020-07-26",
      "2020-05-22",
      "2020-04-30",
      "2020-05-20",
      "2020-07-08",
      "2020-01-23",
      "2020-03-30",
      "2020-01-02",
      "2020-04-20",
      "2020-01-01",
      rep(NA,2)
    )
  )
)

group_time(x = spell_test,
           date_start = 'spell_start',
           date_end = 'spell_end',
           group_vars = c('id','provider'),
           indx_varname = 'spell_id',
           min_varname = 'spell_min_date',
           max_varname = 'spell_max_date')[]

Hospital IN/OUT dates

Description

Derives per‑patient hospital entry (hospital_in) and exit (hospital_out) dates by reconciling A&E (ECDS) attendances and inpatient (HES/SUS) spells. Applies a simple ranking to determine the most relevant hospital period around an index event date (e.g., a specimen collection date).

"1": Current admissions take priority
"2": When conflicting on the same day, inpatient admissions take priority over A&E emergency care data
"3": Where a patient has a linked A&E admission to a hospital inpatient stay, the A&E admission date is used
"4": Where a patient has a positive test between two hospital stays the most recent completed hospital stay prior to the test is retained except if the time between these events is greater than 14 days, then the first admission following the test is retained

Usage

hospital_in_out_dates(
  data,
  person_id = "id",
  hospital = list(org_code = "organisation_code_of_provider", event_date = "ev_date",
    ae_arrive = "arrival_date", ae_depart = "departure_date", ae_discharge =
    "ecds_discharge", in_spell_start = "spell_start_date", in_spell_end =
    "spell_end_date", in_discharge = "discharge_destination")
)

Arguments

data

A linked table containing A&E and inpatient records (typically the output of link_ae_inpatient()), including person/event identifiers and date fields.

person_id

Quoted column name for the unique patient identifier.

hospital

A named list specifying column names (all quoted) for:

org_code: Organisation code (optional; used to scope grouping).
event_date: Index date to compare against (e.g., specimen_date).
ae_arrive: ECDS arrival date.
ae_depart: ECDS departure date.
ae_discharge: ECDS discharge status (use grouped values if available).
in_spell_start: Inpatient spell start date.
in_spell_end: Inpatient spell end date.
in_discharge: Inpatient discharge destination (grouped recommended).

Value

A data.table equal to data with additional columns:

hospital_in: Derived hospital admission date for the relevant stay.
hospital_out: Derived hospital discharge date for the relevant stay.
hospital_event_rank: Rank of suitability of the hospital window for the given person/event (1 = most suitable).

Workflow context

Use hospital_in_out_dates() after:

Linking A&E to inpatient spells (e.g., via link_ae_inpatient()),
Constructing spells (e.g., group_time() or cip_spells()),
Optional code standardisation (e.g., discharge groups via lookup_recode())

Note

Work in progress — functionality is incomplete.

Inpatient Codes cleanup

Description

When HES/SUS ICD/OPCS codes are provided in wide format you may want to clean them up into long for easier analysis. This function helps by reshaping long as a separate table. Ensuring they're separate allows you to retain source data, and aggregate appropriately later.

Usage

inpatient_codes(
  x,
  field_strings,
  patient_id_vars,
  type = c("icd9", "icd10", "opcs"),
  .forceCopy = FALSE
)

Arguments

x

a data.frame or data.table containing inpatient data

field_strings

a vector or string containing the regex for the the columns

patient_id_vars

a vector containing colnames used to identify a patient episode or spell

type

a string to denote if the codes are diagnostic or procedural

.forceCopy

Value

a separate table with codes and id in long form

Examples


# Example inpatient dataset
inpatient_test <- data.frame(
  id = c(1053L, 5487L, 8180L),
  spell_id = c("dwPDw", "iSpUq", "qpgk5"),
  primary_diagnosis_code = c("K602", "U071-", "I501"),
  procedure_code = c("H201", "H251", NA),
  procedure_date = as.Date(c("2023-01-01", "2023-01-04", NA))
)

# ICD-10 cleaning example
inpatient_codes(
  x = inpatient_test,
  field_strings = "diagnosis",
  patient_id_vars = c("id", "spell_id"),
  type = "icd10"
)

# OPCS cleaning example
inpatient_codes(
  x = inpatient_test,
  field_strings = c("procedure_code", "procedure_date"),
  patient_id_vars = c("id", "spell_id"),
  type = "opcs"
)

Synthetic Lab Data for epidm

Description

A dataset containing synthetic lab data for testing epidemiological data transformation functions.

Usage

data(lab_data)

data(lab_data)

Format

A data frame with the following columns:

nhs_number: NHS number
local_patient_identifier: Patient identifier such as hospital number
patient_birth_date: Date of birth of the patients.
sex: Gender of the patients (Factor with levels: "Female", "Male").
surname: Patient surname
forename: Patient forename
organism_species_name: Organism species name (Factor with levels: "KLEBSIELLA PNEUMONIAE").
specimen_date: Date of specimen collection.
specimen_type: Type of specimen: BLOOD or URINE.
lab_code: Laboratory codes (Factor with unique levels).
local_authority_name: Name of the local authority.
local_authority_code: Code of the local authority.
postcode: Postcode

A data frame with the following columns:

nhs_number: NHS number
local_patient_identifier: Patient identifier such as hospital number
patient_birth_date: Date of birth of the patients.
sex: Gender of the patients (Factor with levels: "Female", "Male").
surname: Patient surname
forename: Patient forename
organism_species_name: Organism species name (Factor with levels: "KLEBSIELLA PNEUMONIAE").
specimen_date: Date of specimen collection.
specimen_type: Type of specimen: BLOOD or URINE.
lab_code: Laboratory codes (Factor with unique levels).
local_authority_name: Name of the local authority.
local_authority_code: Code of the local authority.
postcode: Postcode

Examples

data(lab_data)
head(lab_data)

data(lab_data)
head(lab_data)

Link A&E (ECDS) records to Inpatient (HES/SUS) spells

Description

Links Emergency Care Data Set (ECDS) A&E records to inpatient spells (HES/SUS) using patient identifiers (NHS number, hospital number, date of birth), organisation code, and a link date. The inpatient records should already be aggregated to the desired spell level (e.g., provider spell, CIP, or “mega” spell). The output is a patient-level linked table suitable for downstream pathway analysis.

Usage

link_ae_inpatient(
  ae = list(data, record_id = "unique_record_id", nhs_number = "nhs_number",
    hospital_number = "local_patient_identifier", patient_dob = "patient_birth_date",
    org_code = "organisation_code_of_provider", arrival_date = "arrival_date",
    departure_date = "departure_date"),
  inp = list(data, record_id = "unique_record_id", nhs_number = "nhs_number",
    hospital_number = "local_patient_identifier", patient_dob = "date_birth", org_code =
    "organisation_code_code_of_provider", spell_id = "mega_spell_id", spell_start_date =
    "spell_start_date"),
  .forceCopy = FALSE
)

Arguments

ae

A named list describing the A&E (ECDS) input with quoted column names:

data: ECDS A&E dataset (unquoted object).
record_id: Optional unique row id to retain.
arrival_date: A&E arrival date column.
departure_date: A&E departure date column.
nhs_number: NHS number column.
hospital_number: Local patient identifier column.
patient_dob: Date of birth column.
org_code: Provider organisation code column.

inp

A named list describing the inpatient (HES/SUS) input with quoted column names:

data: Inpatient dataset (unquoted object).
record_id: Optional unique row id to retain.
spell_start_date: Inpatient spell start/admission date column.
spell_id: Spell identifier to carry through (e.g., CIP or mega spell id).
nhs_number: NHS number column.
hospital_number: Local patient identifier column.
patient_dob: Date of birth column.
org_code: Provider organisation code column.

.forceCopy

Value

A linked data.table at patient/spell level containing:

Harmonised identifiers (NHS number / hospital number, DOB, org code)
A&E arrival/departure dates
Inpatient spell identifiers and spell start date
(If supplied) preserved record_id columns (tagged ⁠*_ae⁠ / ⁠*_inp⁠)

Workflow context

Use link_ae_inpatient() after:

ECDS A&E data are cleaned/standardised (e.g., discharge categories via lookup_recode()).
Inpatient records have been aggregated to spells (e.g., cip_spells() for continuous inpatient spells).

Use it before:

Rejoining to SGSS infection episodes (e.g., outputs of group_time()), or deriving hospital_in_out_dates for entry/exit timelines.

Examples

## Not run: 
sample_ae <- data.table::data.table(
  nhs_number = c("645114517",
                 "645114517","645114517","382940103","321908341",
                 "321908341","321908341","599534707",
                 "403454211","349959089","252341591","180554160",
                 "180554160","281980720","473372796","369773534",
                 "474044124","639064842","662468568","115641745",
                 "115641745","821084975","479637024",
                 "527021626","527021626","233168855","702650869",NA,
                 "537235036",NA,"517229961",NA,"480142132",
                 "368288558","554675340"),
  local_patient_identifier = c(NA,NA,NA,
                               "I3348707",NA,NA,NA,"P1350948",NA,NA,
                               "Q4157514",NA,NA,"D1101843","K2440769","E1366499",
                               NA,"K1494229","R4678220","J5206297","J5206297",
                               "S1945338","F2159102",NA,NA,"D6300794",NA,
                               "W1208900","Z4975449","G7439612","T1266485",
                               "N4842033","Q5566884","P2689566",NA),
  patient_birth_date = c("2021-03-03",
                         "2021-03-03","2021-03-03","2003-08-24",
                         "2001-06-21","2001-06-21","2001-06-21","1991-10-08",
                         "1987-02-03","1962-06-17","1991-10-07",
                         "1985-10-16","1985-10-16","1990-09-24","1984-11-14",
                         "1994-05-05","1999-08-11","1983-01-04",
                         "2017-06-01","1975-09-04","1975-09-04","1993-07-13",
                         "2014-01-05","1995-09-30","1995-09-30",
                         "1976-06-25","2000-06-02","2017-06-11","2007-05-03",
                         "1986-08-28","2016-10-14","2016-02-04",
                         "2004-03-02","1979-01-17","1974-06-14"),
  organisation_code = c("P3P","P3P",
                        "P3P","Z4R","A9I","A9I","A9I","Z4R","V5T",
                        "Z9V","P3P","T7N","T7N","V2P","T4H","V9Y",
                        "Z7N","W6Y","G2H","V5T","V5T","W6Y","J6J",
                        "J6J","J6J","L4Q","P3P","F0N","A6C","O1A",
                        "F0N","O2R","W6Y","V0R","O1A"),
  arrival_date = c("2022-05-10",
                   "2022-05-29","2022-08-03","2022-05-17",
                   "2022-05-07","2022-05-07","2022-05-07","2022-05-23",
                   "2022-05-13","2022-06-04","2022-05-14",
                   "2022-05-17","2022-06-05","2022-05-25","2022-05-24",
                   "2022-06-09","2022-06-18","2022-06-11",
                   "2022-06-22","2022-06-20","2022-07-18","2022-06-09",
                   "2022-06-05","2022-06-26","2022-06-11",
                   "2022-06-25","2022-06-10","2022-06-12","2022-06-16",
                   "2022-07-10","2022-06-20","2022-07-10",
                   "2022-07-20","2022-07-20","2022-07-19"),
  departure_date = c("2022-05-10",
                     "2022-05-30","2022-08-03","2022-05-17",
                     "2022-05-07","2022-05-07","2022-05-07","2022-05-23",
                     "2022-05-13","2022-06-04","2022-05-14",
                     "2022-05-17","2022-06-05","2022-05-25","2022-05-24",
                     "2022-06-09","2022-06-18","2022-06-11",
                     "2022-06-22","2022-06-20","2022-07-18","2022-06-09",
                     "2022-06-05","2022-06-26","2022-06-11",
                     "2022-06-25","2022-06-10","2022-06-12","2022-06-16",
                     "2022-07-11","2022-06-20","2022-07-10",
                     "2022-07-20","2022-07-20","2022-07-19")
)
sample_ae$pcd <- paste0(sample(LETTERS,1),sample(1:14,1)," ",
                        sample(LETTERS,1),sample(0:9,1),sample(LETTERS,1))
sample_ae$id = seq_len(nrow(sample_ae))*7

sample_inp <- data.table::data.table(
  nhs_number = c("335661151",
                 "335661151","335661151","335661151","335661151",
                 NA,NA,NA,NA,NA,NA,"645114517","645114517",
                 "645114517","143423716","212261130",
                 "212261130","212261130","212261130","212261130",
                 "349959089","317344169","317344169","317344169",
                 "317344169","317344169","317344169","317344169",
                 "317344169","317344169","317344169",
                 "317344169","317344169","317344169","180554160",
                 "180554160",NA,NA,NA,NA,NA,NA,NA,"230782291",
                 "977111015","977111015","977111015","977111015",
                 "683785606",NA,NA,NA,NA,NA,NA,NA,NA,NA,
                 NA,NA,"281980720","270646497",NA,
                 "387252583","639064842","836297039","836297039",
                 "348614531","348614531","662468568","112340924",
                 "112340924","381361439","493239044",NA,NA,NA,
                 "115641745","115641745","233761482","233761482",
                 "479637024","527021626","527021626",
                 "294666415","233168855","702650869","460180094",
                 "561169746","517229961",NA,"480142132","554675340",
                 "135888675",NA,"684718902"),
  local_patient_identifier = c(NA,NA,NA,NA,
                               NA,"D4809270","D4809270","D4809270",
                               "D4809270","D4809270","D4809270",NA,NA,NA,
                               "J2098200","D2139084","D2139084","D2139084","D2139084",
                               "D2139084",NA,NA,NA,NA,NA,NA,NA,NA,NA,
                               NA,NA,NA,NA,NA,NA,NA,"A1706089",
                               "A1706089","A1706089","A1706089","A1706089",
                               "A1706089","A1706089","Z3093435","I1605735","I1888797",
                               "I1888797","I1888797","J2901593","U1111563",
                               "U1111563","U1111563","U1111563","V2246708",
                               "V2246708","V2246708","V2246708","V2246708",
                               "V2246708","V2246708","D1101843",NA,"O2700100",
                               "I5040881","K1494229","I1222012","I1222012",
                               NA,NA,"R4678220","P2632883","P2632883",
                               "J6723431","Y1506318","F3501197","F3501197",
                               "F3501197","J5206297","J5206297","B2651449",
                               "B2651449","F2159102",NA,NA,"W5097806","D6300794",
                               NA,"U2715517","O5278248","T1266485",
                               "N4842033","Q5566884",NA,"X2768295","H3196212",
                               "J9365439"),
  date_birth = c("2021-08-14",
                         "2021-08-14","2021-08-14","2021-08-14",
                         "2021-08-14","1960-05-20","1960-05-20","1960-05-20",
                         "1960-05-20","1960-05-20","1960-05-20",
                         "2021-03-03","2021-03-03","2021-03-03","2019-11-09",
                         "1953-04-05","1953-04-05","1953-04-05",
                         "1953-04-05","1953-04-05","1962-06-17","1952-04-03",
                         "1952-04-03","1952-04-03","1952-04-03",
                         "1952-04-03","1952-04-03","1952-04-03","1952-04-03",
                         "1952-04-03","1952-04-03","1952-04-03",
                         "1952-04-03","1952-04-03","1985-10-16","1985-10-16",
                         "1993-07-09","1993-07-09","1993-07-09",
                         "1993-07-09","1993-07-09","1993-07-09","1993-07-09",
                         "1980-10-14","1976-08-03","1976-08-03",
                         "1976-08-03","1976-08-03","1981-08-27","2017-08-20",
                         "2017-08-20","2017-08-20","2017-08-20",
                         "2017-08-20","2017-08-20","2017-08-20","2017-08-20",
                         "2017-08-20","2017-08-20","2017-08-20",
                         "1989-07-11","1964-04-30","1991-12-25","1961-08-16",
                         "1983-01-04","1957-01-29","1957-01-29",
                         "1982-12-05","1982-12-05","2017-06-01","1989-09-21",
                         "1989-09-21","1986-10-06","1995-03-01",
                         "1964-04-25","1964-04-25","1964-04-25","1975-09-04",
                         "1975-09-04","1995-06-17","1995-06-17",
                         "2014-01-05","1995-09-30","1995-09-30","1993-06-09",
                         "1976-06-25","2000-06-02","1986-09-14",
                         "2016-11-19","2016-10-14","2016-02-04","2004-03-02",
                         "1974-06-14","1945-05-14","2001-09-16",
                         "1987-08-19"),
  organisation_code = c("L4Q","L4Q",
                        "L4Q","P3P","P3P","U6X","U6X","U6X","U6X",
                        "U6X","U6X","P3P","P3P","P3P","L4Q","O2B",
                        "O2B","O2B","O2B","O2B","Z9V","U8V","U8V",
                        "U8V","U8V","U8V","U8V","U8V","U8V","U8V",
                        "U8V","U8V","U8V","U8V","T7N","T7N","V5T",
                        "V5T","V5T","V5T","V5T","V5T","V5T","V7E",
                        "J6J","Y9V","Y9V","Y9V","V7E","B1A","B1A",
                        "B1A","B1A","J2W","J2W","J2W","J2W","J2W",
                        "J2W","J2W","V2P","O1A","O2A","F1O","W6Y",
                        "T2Y","T2Y","G2H","G2H","G2H","J6J","J6J",
                        "J6J","V5T","G2H","G2H","G2H","V5T","V5T",
                        "T4H","T4H","J6J","J6J","J6J","G7H","L4Q",
                        "P3P","L4Q","U8V","F0N","O2R","W6Y","O1A",
                        "A9V","G9V","L4Q"),
  mega_spell_id = c("10.3.0",
                    "10.3.1","10.3.2","4.2.0","4.2.1","7.12.0",
                    "7.12.1","7.12.2","7.12.3","7.12.4","7.12.5",
                    "14.3.0","14.3.1","14.3.2","22.2.1","49.6.0",
                    "49.6.1","49.6.2","49.6.3","49.6.4","69.1.0",
                    "76.20.0","76.20.1","76.20.2","76.20.3",
                    "76.20.4","76.20.5","76.20.6","76.20.7","76.20.8",
                    "76.20.9","76.20.10","76.20.11","76.20.12",
                    "77.7.0","77.7.1","79.7.0","79.7.1","79.7.2",
                    "79.7.3","79.7.4","79.7.5","79.7.6","83.1.0",
                    "157.1.0","85.5.0","85.5.1","85.5.2","90.1.0",
                    "167.4.0","167.4.1","167.4.2","167.4.3",
                    "91.9.0","91.9.1","91.9.2","91.9.3","91.9.4",
                    "91.9.5","91.9.6","101.2.0","111.5.0","122.1.0",
                    "151.1.0","154.1.0","161.3.0","161.3.1",
                    "181.4.0","181.4.1","184.1.0","185.2.0","185.2.1",
                    "201.1.0","214.1.0","226.3.0","226.3.1",
                    "226.3.2","247.4.0","247.4.1","266.4.0","266.4.1",
                    "269.2.0","270.2.0","270.2.1","284.1.0",
                    "299.2.0","307.1.0","314.3.0","345.1.0",
                    "400.1.0","419.1.0","430.3.1","494.3.0","498.1.0",
                    "501.1.0","535.1.0"),
  spell_start_date = c("2022-05-20",
                       "2022-06-14","2022-06-20","2022-05-01",
                       "2022-05-07","2022-05-16","2022-05-29","2022-05-18",
                       "2022-05-21","2022-06-27","2022-07-18",
                       "2022-05-10","2022-05-29","2022-08-03","2022-05-13",
                       "2022-05-27","2022-05-20","2022-06-09",
                       "2022-06-27","2022-07-27","2022-06-04","2022-05-21",
                       "2022-06-18","2022-06-24","2022-07-16",
                       "2022-07-10","2022-07-17","2022-07-20","2022-07-15",
                       "2022-07-08","2022-08-01","2022-08-04",
                       "2022-07-27","2022-07-14","2022-05-17","2022-06-05",
                       "2022-06-01","2022-06-05","2022-06-09",
                       "2022-05-23","2022-05-27","2022-06-10","2022-06-12",
                       "2022-05-29","2022-05-29","2022-06-02",
                       "2022-05-29","2022-06-15","2022-06-09","2022-06-21",
                       "2022-05-29","2022-07-18","2022-07-26",
                       "2022-05-12","2022-06-12","2022-06-11","2022-06-28",
                       "2022-06-29","2022-06-22","2022-07-09",
                       "2022-05-19","2022-05-25","2022-05-18","2022-05-26",
                       "2022-06-11","2022-06-21","2022-06-13",
                       "2022-05-27","2022-06-27","2022-06-22","2022-06-08",
                       "2022-07-04","2022-06-26","2022-06-26",
                       "2022-06-11","2022-06-17","2022-07-13","2022-06-20",
                       "2022-07-18","2022-06-15","2022-07-03",
                       "2022-06-05","2022-06-26","2022-06-11","2022-06-24",
                       "2022-06-25","2022-06-10","2022-07-01",
                       "2022-07-04","2022-06-20","2022-07-10","2022-07-20",
                       "2022-07-19","2022-07-28","2022-07-27",
                       "2022-07-10"),
  spell_end_date = c("2022-06-17",
                     "2022-06-16","2022-06-20","2022-05-01",
                     "2022-05-11","2022-05-16","2022-05-29","2022-05-18",
                     "2022-05-21","2022-07-05","2022-07-18",
                     "2022-05-11","2022-06-01","2022-08-04","2022-05-13",
                     "2022-05-29","2022-05-21","2022-06-10",
                     "2022-06-30","2022-07-28","2022-06-05","2022-06-25",
                     "2022-06-18","2022-06-24","2022-07-16",
                     "2022-07-10","2022-07-17","2022-07-20","2022-07-15",
                     "2022-07-08","2022-08-01","2022-08-04",
                     "2022-07-27","2022-07-14","2022-05-30","2022-06-10",
                     "2022-06-01","2022-06-05","2022-06-09",
                     "2022-05-23","2022-05-27","2022-06-10","2022-06-12",
                     "2022-05-30","2022-06-05","2022-06-02",
                     "2022-06-01","2022-06-15","2022-06-15","2022-06-21",
                     "2022-05-29","2022-07-18","2022-07-26",
                     "2022-05-15","2022-06-12","2022-06-17","2022-06-28",
                     "2022-06-29","2022-06-23","2022-07-11",
                     "2022-05-19","2022-06-09","2022-05-23","2022-05-26",
                     "2022-06-20","2022-06-21","2022-06-14",
                     "2022-06-02","2022-06-27","2022-06-23","2022-06-08",
                     "2022-07-04","2022-06-26","2022-06-28",
                     "2022-06-11","2022-06-17","2022-07-13","2022-06-21",
                     "2022-07-20","2022-06-18","2022-07-03",
                     "2022-06-11","2022-06-26","2022-06-11","2022-06-24",
                     "2022-07-01","2022-06-10","2022-07-06",
                     "2022-07-06","2022-06-23","2022-07-11","2022-07-22",
                     "2022-07-22","2022-07-28","2022-07-27",
                     "2022-07-10")
)

sample_inp$pcd <- paste0(sample(LETTERS,1),sample(1:14,1)," ",
                         sample(LETTERS,1),sample(0:9,1),sample(LETTERS,1))
sample_inp$id = seq_len(nrow(sample_inp))*3

link_ae_inpatient(
  ae = list(
    data = sample_ae,
    arrival_date = 'arrival_date',
    departure_date = 'departure_date',
    nhs_number = 'nhs_number',
    hospital_number = 'local_patient_identifier',
    patient_dob = 'patient_birth_date',
    org_code = 'organisation_code'
  ),
  inp = list(
    data = sample_inp,
    spell_id = 'mega_spell_id',
    spell_start_date = 'spell_start_date',
    nhs_number = 'nhs_number',
    hospital_number = 'local_patient_identifier',
    patient_dob = 'date_birth',
    org_code = 'organisation_code'
  )
)[]

## End(Not run)

Lookup table switch handler

Description

A function to recode values via named lookup tables (i.e call an epidm lookup table and recode where we are aware of a new value). It routes to a specific lookup based on type, returning a character vector where each input value has been mapped to its corresponding replacement. If a value is not found in the lookup then the original value is returned.

Built‑in lookups include:

species: Uses the respeciate_organism dataset to standardise and reclassify organism names (e.g., historic → current nomenclature). This supports consistent reporting across SGSS and other laboratory datasets.
specimen: Uses the specimen_type_grouping dataset to assign raw laboratory specimen types into harmonised specimen groups. This enables consistent grouping for reporting, aggregation, and filtering.
genus_gram_stain: Uses the genus_gram_stain lookup table, which provides Gram stain classifications by bacterial genus. This reference is manually maintained against the UKHSA SGSS database and supports rapid filtering and high‑level organism categorisation. Users should raise an issue or submit a pull request to the epidm GitHub repository if an organism/genus is missing.
lab_data: Uses the lab_data lookup dataset for harmonising laboratory code systems and internal SGSS mappings, supporting standardised laboratory result interpretation within surveillance pipelines.
inpatient_admission_method: Uses the internal lookup table epidm:::group_inpatient_admission_method to categorise raw hospital admission method codes into operationally meaningful groups.
inpatient_discharge_destination: Uses the internal table epidm:::group_inpatient_discharge_destination to group hospital discharge destination codes into standardised categories for inpatient pathway analysis.
ecds_destination_code: Uses the internal table epidm:::group_ecds_discharge_destination, providing grouped mappings for ECDS (Emergency Care Data Set) discharge codes.
manual: Allows the user to supply their own lookup through .import = list(new, old). This is useful when working with local, provisional, or evolving code sets not yet included in the package’s centralised lookup tables.

Usage

lookup_recode(
  src,
  type = c("species", "specimen", "inpatient_admission_method",
    "inpatient_discharge_destination", "ecds_destination_code", "manual"),
  .import = NULL
)

Arguments

src

Character vector (or column) of values to recode. Coerced to character if needed.

type

Character scalar specifying the lookup to use. One of: 'species', 'specimen', 'inpatient_admission_method', 'inpatient_discharge_destination', 'ecds_destination_code', 'manual'.

.import

A two‑element list in the format list(new, old) used only when type = 'manual'. Each element must be a vector of equal length.

Value

A character vector containing the recoded values, aligned 1:1 with src. Values not present in the lookup are returned unchanged.

Examples

df <- data.frame(
  spec = c(
    sample(grep(")",
                respeciate_organism$previous_organism_name,
                value=TRUE,
                invert = TRUE),
           9),
    "ESCHERICHIA COLI","SARS-COV-2","CANDIDA AUREUS"),
  type = sample(specimen_type_grouping$specimen_type,12),
  date = sample(seq.Date(from = Sys.Date()-365,
                         to = Sys.Date(),
                         by = "day"),12)
)
df <- df[order(df$date),]

# show the data before the changes
df

# check the lookup tables
# observe the changes
head(respeciate_organism[1:2])
df$species <- lookup_recode(df$spec,'species')
df[,c('spec','species','date')]

head(specimen_type_grouping)
df$grp <- lookup_recode(df$type,'specimen')
df[,c('species','type','grp','date')]

# for a tidyverse use
# df %>% mutate(spec=lookup_recode(spec,'species))

# manual input of your own lookup
# .import=list(new,old)
lookup_recode(
  "ALCALIGENES DENITRIFICANS",
  type = 'manual',
  .import=list(respeciate_organism$organism_species_name,
               respeciate_organism$previous_organism_name)
  )

Clean and Impute HES/SUS Episode Start and End Dates

Description

A utility for cleaning and imputing missing or inconsistent episode end dates in HES/SUS–style inpatient data. The function identifies missing, invalid, or overlapping spell dates within patient/provider groups and applies deterministic rules to correct them. It also assigns a flag (proxy_missing) indicating whether a value was modified and why.

Usage

proxy_episode_dates(
  x,
  group_vars,
  spell_start_date,
  spell_end_date,
  discharge_destination,
  .dropTmp = TRUE,
  .forceCopy = FALSE
)

Arguments

x

A data.frame or data.table. Will be converted to a data.table if not already.

group_vars

Character vector of grouping variables (e.g., patient ID, provider). At least one identifier must be supplied.

spell_start_date

Name of the column containing the episode or spell start date.

spell_end_date

Name of the column containing the episode or spell end date.

discharge_destination

Name of the column containing the CDS discharge destination code.

.dropTmp

Logical (default TRUE). If TRUE, temporary processing columns are removed before returning the result.

.forceCopy

Value

A data.table containing:

Cleaned spell start and end dates.
A flag variable (proxy_missing) indicating whether a date was modified and the rule applied (0–4).

Examples


proxy_test <- data.frame(
  id = c(
    rep(3051, 4),
    rep(7835,3),
    rep(9891,3),
    rep(1236,3)
  ),
  provider = c(
    rep("QKJ", 4),
    rep("JSD",3),
    rep("YJG",3),
    rep("LJG",3)
  ),
  spell_start = as.Date(c(
    "2020-07-03", "2020-07-14", "2020-07-23", "2020-08-05",
    "2020-11-01", "2020-11-13", "2020-12-01",
    "2020-03-28", "2020-04-06", "2020-04-09",
    "2020-10-06", "2020-11-05", "2020-12-25"
  )),
  spell_end = as.Date(c(
    "2020-07-11", "2020-07-22", "2020-07-30", "2020-07-30",
    "2020-11-11", NA, "2020-12-03",
    "2020-03-28", NA, "2020-04-09",
    "2020-10-06", "2020-11-05", NA
  )),
  disdest = c(
    19, 19, 51, 19,
    19, 19, 19,
    51, 98, 19,
    19, 19, 98
  )
)


proxy_episode_dates(
  x=proxy_test,
  group_vars = c('id','provider'),
  spell_start_date = 'spell_start',
  spell_end_date = 'spell_end',
  discharge_destination = 'disdest'
)[]

Respeciate unspecified samples

Description

Some samples within SGSS are submitted by laboratories as "GENUS SP" or "GENUS UNNAMED". However, they may also have a fully identified sample taken from the same site within a recent time period. This function captures species_col from another sample within X-days of an unspeciated isolate. Respeciation is restricted to organisms of the same genus; species will not be inferred from isolates belonging to a different genus. Trailing "UNNAMED" is normalised to "SP" before any processing.

Usage

respeciate_generic(
  x,
  group_vars,
  species_col,
  date_col,
  window = c(0:Inf),
  .forceCopy = FALSE
)

Arguments

x

a data.frame or data.table object

group_vars

the minimum grouping set of variables for like samples in a character vector; suggest c('patient_id','specimen_type') - genus will automatically be included in the groupby. This is built from the species_col

species_col

a character containing the column with the organism species_col name

date_col

a character containing the column with the specimen/sample date_col

window

an integer representing the number of days for which you will allow a sample to be respeciated

.forceCopy

Value

a data.table with a recharacterised species_col column

Examples


df <- data.frame(
ptid = c(round(runif(25,1,5))),
spec = sample(c("KLEBSIELLA SP",
                "KLEBSIELLA UNNAMED",
                "KLEBSIELLA PNEUMONIAE",
                "KLEBEIELLA OXYTOCA"),
              25,replace = TRUE),
type = "BLOOD",
specdate = sample(seq.Date(Sys.Date()-21,Sys.Date(),"day"),25,replace = TRUE)
)

respeciate_generic(x=df,
                   group_vars=c('ptid','type'),
                   species_col='spec',
                   date_col='specdate',
                   window = 14)[]

Respeciated organisms

Description

Occasionally, research shows that two organisms, previously thought to be different are in fact one and the same. The reverse is also true. This is a manually updated list. If there are organisms missing, or new respeciates to be added, please raise and issue or push request on the epidm GitHub

Usage

respeciate_organism

Format

previous_organism_name: What the organism used to be known as, in the form GENUS SPECIES
organism_species_name: What the organism is known as now, in the form GENUS SPECIES
organism_genus_name: The genus of the recoded organism
genus_change: A 0/1 flag to indicate if the genus has changed
genu_all_species: A 0/1 flag to indicate if all species under that genus should change

Specimen type grouping

Description

In order to help clean up an analysis based on a group of specimen types, a lookup table has been created to help group sampling sites. This is a manually updated list. If there are organisms missing, or new respeciates to be added, please raise and issue or push request on the epidm GitHub

Usage

specimen_type_grouping

Format

specimen_type: The primary specimen type with detail
specimen_group: A simple grouping of like specimen sites

Clean and Read a SQL query

Description

A utility function to read in a SQL query from a character object, clipboard or text file and remove all comments for use with database query packages

Usage

sql_clean(sql)

Arguments

sql

a SQL file or text string

Value

a cleaned SQL query without comments as a character string

Examples

testSQL <- c(
"/********* INTRO HEADER COMMENTS",
"*********/",
"  SELECT ",
"  [VAR 1]  -- with comments",
",[VAR 2]",",[VAR 3]",
"FROM DATASET ","-- output here")
sql_clean(testSQL)

Connect to a SQL database

Description

An function to help setup connections to SQL databases acting as a wrapper for the odbc and DBI packages. Used by other sql_* tools within epidm. This uses the credential manager within the system and assumes you are using a trusted connection.

Usage

sql_connect(server, database)

Arguments

server

a string containing the server connection; note that servers may require the use of double backslash ⁠\\⁠

database

a string containing the database name within the data store

Value

a SQL connection object

Examples

## Not run: 
sql <- list(
  dsn = list(ser = 'covid.ukhsa.gov.uk',
             dbn = 'infections')
)

sgss_con = sql_connect(server = sql$dsn$ser, database = sql$dsn$dbn)

## End(Not run)

Read a table from a SQL database

Description

Read a table object to a SQL database. Acts a wrapper for odbc and DBI packages.

Usage

sql_read(server, database, sql)

Arguments

server

a string containing the server connection

database

a string containing the database name within the data store

sql

a string containing a SQL query or to a .sql/.txt SQL query

Value

a table from a SQL database

Write a table to a SQL database

Description

Write a table object to a SQL database. Acts a wrapper for odbc and DBI packages with additional checks to ensure upload completes.

Usage

sql_write(x, server, database, tablename)

Arguments

x

a data.frame/data.table/tibble object

server

a string containing the server connection

database

a string containing the database name within the data store

tablename

a string containing the chosen SQL database table name

Value

writes a data.frame/data.table/tibble to a SQL database

Patient ID record grouping

Description

Assigns a single integer id to records that belong to the same patient by applying a sequence of deterministic matching stages across common identifiers (NHS number, hospital number, DOB, name, sex, postcode). Identifiers are standardised, validated using NHS checksum function, and fuzzy name keys are used in later stages. ' Matching is performed in order through the following stages (first match is applied):

NHS number + date of birth
Hospital number + date of birth
NHS number + hospital number
NHS number + surname
Hospital number + surname
Date of birth + surname (only where NHS is invalid/absent)
Sex + full name (forename + surname)
Sex + date of birth + fuzzy name (Soundex; surname + initial)
Date of birth (YYYY-MM) + fuzzy name
Surname/forename + postcode
Name swaps (forename/surname reversed) + date of birth

Use .useStages to restrict which stages are applied (default: 1:11). The function generates a reproducible id per patient within the sort order; you can provide .sortOrder (e.g., a date column) to make assignment deterministic.

Validity rules applied:

NHS number validated using the standard checksum (epidm::valid_nhs()).
Hospital number: excludes known placeholders (e.g., "UNKNOWN", "NO PATIENT ID").
DOB: excludes proxy or missing dates ("1900-01-01", "1800-01-01", NA).
Sex: normalised to "M" / "F"; others → NA.
Names: uppercased, Latin characters normalised; Soundex used for fuzzy matching.

Identifiers are copied over where they are missing or invalid to the grouped records.

Usage

uk_patient_id(
  x,
  id = list(nhs_number = "nhs_number", hospital_number = "patient_hospital_number",
    date_of_birth = "date_of_birth", sex_mfu = "sex", forename = "forename", surname =
    "surname", postcode = "postcode"),
  .useStages = c(1:11),
  .keepStages = FALSE,
  .keepValidNHS = FALSE,
  .sortOrder,
  .forceCopy = FALSE
)

Arguments

x

A data.frame or data.table with patient identifiers.

id

A named list of quoted column names:

nhs_number: NHS number.
hospital_number: Local patient identifier (hospital number).
date_of_birth: Date of birth.
sex_mfu: Sex/gender (M/F/Unknown).
forename: Forename / first name.
surname: Surname / last name.
postcode: Patient postcode.

.useStages

optional, default 1:11; set to 1 if you wish patient ID to be assigned cases with the same DOB and NHS number, set to 2 if you wish patient ID to be assigned to cases with the same hospital number (HOS) and DOB, set to 3 if you wish patient ID to be assigned cases with the same NHS and HOS number, set to 4 if you wish patient ID to be assigned cases with the same NHS number and surname, set to 5 if you wish patient ID to be assigned cases with the same hospital number and surname, set to 6 if you wish patient ID to be assigned cases with the same DOB and surname, set to 7 if you wish patient ID to be assigned cases with the same sex and full name, set to 8 if you wish patient ID to be assigned cases with the same sex, DOB and fuzzy name, set to 9 if you wish patient ID to be assigned cases with the same DOB and fuzzy name, set to 10 if you wish patient ID to be assigned cases with the same name and postcode, set to 11 if you wish patient ID to be assigned cases with the same first name or second name in changing order and date of birth.

.keepStages

optional, default FALSE; to generate a new column (stageMatch) to retain the stage information for which the record matched the group.

.keepValidNHS

optional, default FALSE; set TRUE if you wish to retain the column with the NHS checksum result stored as a BOOLEAN

.sortOrder

optional; a column as a character to allow a sorting order on the id generation

.forceCopy

optional, default FALSE; TRUE will force data.table to take a copy instead of editing the data without reference

Value

A data.table with the original columns plus:

id: Integer patient identifier assigned by staged matching.
valid_nhs: (Optional) BOOLEAN NHS checksum flag; included when .keepValidNHS = TRUE.

Workflow context

uk_patient_id() is typically used early to harmonise patient identity across isolates before downstream tasks such as specimen episode grouping (group_time()), dataset linkage (e.g., to HES/SUS/ECDS), and epidemiological reporting.

Examples

id_test <-
  data.frame(
    stringsAsFactors = FALSE,
    record_id = c(1L,2L,3L,4L,
                  5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,
                  16L,17L,18L,19L,20L,21L,22L,23L,24L),
    nhs_number = c(9435754422,
                   9435754422,NA,9435754422,5555555555,NA,
                   9435773982,NA,9999999999,NA,9435773982,NA,
                   9435802508,9435802508,NA,NA,9435802508,9435802508,NA,
                   3333333333,NA,9999999999,9435817777,
                   9435817777),
    local_patient_identifier = c(NA,"IG12067",
                                 NA,NA,"IG12067","IG12067","KR2535","KR2535",
                                 "KR2535",NA,NA,NA,"UK8734","UK8734",NA,NA,
                                 "UK8734","UK8734",NA,NA,"JH45204",
                                 "HS45202","HS45202","JH45204"),
    patient_birth_date = c("1993-07-16",
                           "1993-07-16","1993-07-16","1993-07-16",
                           "1993-07-16",NA,"1967-02-10",NA,"1967-02-10",NA,NA,
                           "1967-02-10",NA,NA,"1952-10-22","1952-10-22",
                           "1952-10-22",NA,"1947-09-14","1947-09-14",
                           "1947-09-14","1947-09-14","1947-09-14",
                           "1947-09-14"),
    sex = c("Male","Male",
            "Male","Male",NA,"Male","Female","Female",
            "Female","Female","Female","Female","Male",
            "Male","Male","Male","Male","Male","Male",
            "Male","Male","Male",NA,"Male"),
    forename = c(NA,"DENNIS",
                 NA,NA,"DENNIS",NA,"ELLIE","ELLIE",NA,
                 "ELLIE","ELLIE","ELLIE","IAN","IAN","MALCOLM",
                 "IAN","IAN",NA,"GRANT","ALAN","ALAN","ALAN",
                 "GRANT","ALAN"),
    surname = c(NA,"NEDRY",
                "NEDRY",NA,"NEDRY","NEDRY","SATTLER","SATTLER",
                NA,"SATTLER","SATTLER","SATTLER","M",NA,
                "IAN","MALCOLM","MALCOLM",NA,"ALAN","GRANT",
                "GRANT","GRANT","ALAN","GRANT"),
    postcode = c("HA4 0FF",
                 "HA4 0FF","HA4 0FF",NA,"HA4 0FF","HA4 0FF",
                 "L3 1DZ","L3 1DZ","L3 1DZ","L3 1DZ",NA,"L3 1DZ",
                 "BN14 9EP",NA,"BN14 9EP",NA,NA,NA,"CW6 9TX",
                 "CW6 9TX",NA,NA,NA,NA),
    specimen_date = c("2024-08-14",
                      "2023-02-03","2023-02-07","2023-02-04",
                      "2023-02-09","2024-08-14","2021-03-28","2021-03-28",
                      "2021-03-28","2021-03-28","2021-03-28",
                      "2021-03-28","2024-07-06","2024-07-06","2024-07-06",
                      "2023-10-31","2023-10-31","2023-10-31",
                      "2022-01-23","2022-01-24","2022-01-25","2022-01-26",
                      "2022-01-27","2022-01-28")
  )

data.table::setDT(id_test)

uk_patient_id(
  x = id_test,
  id = list(
    nhs_number = 'nhs_number',
    hospital_number = 'local_patient_identifier',
    date_of_birth = 'patient_birth_date',
    sex_mfu = 'sex',
    forename = 'forename',
    surname = 'surname',
    postcode = 'postcode'
  ),
  .sortOrder = 'specimen_date',
  .useStages = c(1:11),
  .keepStages = TRUE,
  .forceCopy = TRUE)[]

NHS Number Validity Check

Description

Validates NHS numbers using the NHS checksum applied to the first 9 digits and compared with the 10th digit. Inputs that are not exactly 10 numeric characters are invalid. Numbers made of the same repeated digit (e.g., "1111111111", "0000000000") are also rejected.

Usage

valid_nhs(nhs_number)

Arguments

nhs_number

A vector of values to validate. Each element is coerced to character for length checking and digit extraction.

Details

Algorithm (summary):

Take the first 9 digits and multiply by weights 10 down to 2 (i.e., d1×10 + d2×9 + … + d9×2).
Compute 11 - (sum %% 11) → this is the expected check digit.
If the expected check digit is 11, treat as 0.
Compare with the 10th digit. If they match, the number is valid.

Additional guards implemented:

If the NHS number is NA or not 10 characters, it is invalid.
If all 10 digits are identical (e.g., "1111111111"), it is invalid.

The function is vectorised and returns 1 for valid and 0 for invalid for each element in the input vector.

Value

A numeric vector of the same length as nhs_number containing:

1 if the value is a valid NHS number
0 otherwise

Examples

test <- floor(runif(1000,1000000000,9999999999))
valid_nhs(test)
valid_nhs(9434765919)

Continuous Inpatient (CIP) Spells

Description

Usage

Arguments

Value

Examples

Download a csv from a zip

Description

Usage

Arguments

Value

Examples

Bacterial Genus Gram Stain Lookup Table

Description

Usage

Format

A&E attendance discharge destination

Description

Usage

Format

Inpatient admission methods

Description

Usage

Format

Inpatient discharge destination

Description

Usage

Format

Grouping of intervals or events that occur close together in time

Description

Usage

Arguments

Details

How the function works

1. Interval-based grouping (start + end dates)

2. Event-based grouping (single-date events)

Handling of missing values

Value

Workflow context

Examples

Hospital IN/OUT dates

Description

Usage

Arguments

Value

Workflow context

Note

See Also

Inpatient Codes cleanup

Description

Usage

Arguments

Value

Examples

Synthetic Lab Data for epidm

Description

Usage

Format

Examples

Link A&E (ECDS) records to Inpatient (HES/SUS) spells

Description

Usage

Arguments

Value

Workflow context

See Also

Examples

Lookup table switch handler

Description

Usage

Arguments

Value

Examples

Clean and Impute HES/SUS Episode Start and End Dates

Description

Usage

Arguments

Value

Examples

Respeciate unspecified samples