| Version: | 1.0.6 |
| Title: | UK Epidemiological Data Management |
| Description: | Contains utilities and functions for the cleaning, processing and management of patient level public health data for surveillance and analysis held by the UK Health Security Agency, UKHSA. |
| URL: | https://github.com/ukhsa-collaboration/epidm |
| BugReports: | https://github.com/ukhsa-collaboration/epidm/issues |
| License: | GPL (≥ 3) |
| Depends: | R (≥ 4.4.0) |
| Imports: | data.table, DBI, odbc, phonics, purrr, readr, stats, stringi, stringr, utils, lubridate |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.3 |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0), patrick, ggplot2, dplyr, tibble |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-02-20 13:52:56 UTC; Frederick.Sloots |
| Author: | Alice Graham |
| Maintainer: | Frederick Sloots <frederick.sloots@ukhsa.gov.uk> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-25 14:30:03 UTC |
Continuous Inpatient (CIP) Spells
Description
Creates Continuous Inpatient (CIP) spells by combining one or more provider spells into a single uninterrupted period of inpatient care. CIP definitions follow the NHS Digital methodology: transfers between providers can be part of the same CIP spell where specific admission, discharge, and timing criteria are met. A CIP spell begins when a patient is admitted under consultant care and ends when they are discharged or die. http://content.digital.nhs.uk/media/11859/Provider-Spells-Methodology/pdf/Spells_Methodology.pdf
Where spells meet the CIP criteria, they are merged into a continuous spell. The output includes a CIP index and the derived start and end dates for the full CIP period.
Usage
cip_spells(
x,
group_vars,
spell_start_date,
admission_method,
admission_source,
spell_end_date,
discharge_destination,
patient_classification,
.forceCopy = FALSE
)
Arguments
x |
A data.frame or data.table; will be converted to a data.table (usually HES/SUS data) |
group_vars |
Character vector of variables used to group records (minimum: a patient identifier). |
spell_start_date |
Quoted column name containing the provider spell admission date. |
admission_method |
CDS admission method code. |
admission_source |
CDS admission source code. |
spell_end_date |
Quoted column name containing the provider spell discharge date. |
discharge_destination |
CDS discharge destination code. |
patient_classification |
CDS patient classification code. |
.forceCopy |
Logical (default |
Value
A data.table containing the original data and three new variables:
cip_indxUnique identifier for the derived CIP spell.
cip_spell_startStart date for the continuous inpatient spell.
cip_spell_endEnd date for the continuous inpatient spell.
Examples
cip_test <- data.frame(
id = c('465','465','465','465','8418','8418','8418',
'8418','8418','8418','8418','8418','26443',
'26443','26443','33299','33299','33299','33299',
'33299','33299','33299','33299','33299','33299',
'52635','52635','52635','52635','52635','52635',
'52635','52635','52635','52635','52635','52635',
'52635','52635','52635','52635','52635','52635',
'52635','52635','52635','52635','52635','52635',
'52635','52635','52635','78915','78915','78915'),
provider = c('X1T','X1T','X1T','X1T','KHA','KHA','KHA',
'KHA','KHA','KHA','KHA','KHA','BX2','BX2',
'BX2','PXH','PXH','PXH','PXH','PXH','PXH',
'PXH','PXH','PXH','PXH','9HA','9HA','9HA',
'9HA','9HA','9HA','9HA','9HA','9HA','9HA',
'9HA','9HA','9HA','9HA','9HA','9HA','YYT',
'YYT','YYT','YYT','YYT','YYT','YYT','YYT',
'YYT','YYT','YYT','ABX','ABX','ABX'),
spell_start = as.Date(c(
'2020-03-07','2020-03-07','2020-03-25','2020-04-03','2020-01-25',
'2020-01-26','2020-07-14','2020-08-02','2020-08-12','2020-08-19',
'2020-08-19','2020-11-19','2019-11-12','2020-04-17','2020-04-23',
'2020-07-03','2020-01-17','2020-02-07','2020-03-20','2020-04-27',
'2020-06-21','2020-07-02','2020-10-17','2020-11-27','2021-01-02',
'2019-12-31','2020-01-02','2020-01-14','2020-01-16','2020-02-07',
'2020-02-11','2020-02-14','2020-02-18','2020-02-21','2020-02-25',
'2020-02-28','2020-03-09','2020-03-11','2020-03-12','2020-03-13',
'2020-03-14','2020-02-04','2020-02-07','2020-02-11','2020-02-14',
'2020-02-18','2020-02-21','2020-02-25','2020-02-28','2020-03-09',
'2020-03-11','2020-03-12','2020-04-16','2020-04-24','2020-05-13')),
spell_end = as.Date(c(
'2020-03-07','2020-03-25','2020-04-02','2020-04-27','2020-01-25',
'2020-01-27','2020-07-17','2020-08-07','2020-08-14','2020-08-19',
'2020-08-22','2020-12-16','2020-04-17','2020-04-23','2020-05-20',
'2020-07-24','2020-01-28','2020-02-07','2020-03-23','2020-04-29',
'2020-06-21','2020-07-03','2020-11-27','2021-01-02','2021-01-10',
'2019-12-31','2020-01-11','2020-01-14','2020-02-04','2020-02-07',
'2020-02-11','2020-02-14','2020-02-18','2020-02-21','2020-02-25',
'2020-02-28','2020-03-09','2020-03-11','2020-03-12','2020-03-13',
'2020-03-30','2020-02-07','2020-02-11','2020-02-14','2020-02-18',
'2020-02-21','2020-02-25','2020-02-28','2020-03-09','2020-03-11',
'2020-03-12','2020-03-13','2020-04-24','2020-05-13','2020-06-11')),
adm_meth = c('21','81','21','81','21','21','11','21','21','21','21',
'21','21','81','21','81','21','21','21','21','21','21',
'21','13','13','12','22','12','2D','13','13','13','13',
'13','13','13','13','13','13','13','21','81','81','81',
'81','81','13','81','81','13','13','13','21','11','81'),
adm_src = c('19','51','19','51','19','51','19','51','19','19','19',
'51','19','51','19','51','19','19','19','19','19','19',
'19','51','19','19','19','19','19','19','19','19','19',
'19','19','19','51','51','51','51','19','51','51','51',
'51','51','51','51','51','51','51','51','19','51','51'),
dis_meth = c('1','1','1','1','1','1','1','1','1','1','1','4','1','1',
'4','1','1','1','1','1','1','1','8','1','4','1','1','1',
'1','1','1','1','1','1','1','1','1','1','1','1','1','1',
'1','1','1','1','1','1','1','1','1','1','1','1','2'),
dis_dest = c('51','51','51','54','51','19','19','19','19','51','19',
'79','51','51','79','65','19','19','19','19','19','29',
'98','51','79','19','19','19','51','19','19','19','51',
'51','51','19','19','51','51','19','51','51','51','51',
'51','51','51','51','51','51','51','51','29','54','19'),
patclass = c('1','1','1','1','1','1','1','1','1','1','1','1','1','1',
'1','1','1','1','1','1','1','1','1','1','1','2','1','2',
'1','2','2','2','2','2','2','2','2','2','2','2','1','1',
'1','1','1','1','1','1','1','1','1','1','1','1','1')
)
cip_spells(x=cip_test,
group_vars = c('id','provider'),
patient_classification = 'patclass',
spell_start_date = 'spell_start',
admission_method = 'adm_meth',
admission_source = 'adm_src',
spell_end_date = 'spell_end',
discharge_destination = 'dis_dest'
)[]
Download a csv from a zip
Description
A convenience function to allow you to pull data from NHS, ONS and ODR assets
Usage
csv_from_zip(x)
Arguments
x |
a zip file from the web |
Value
a zip file for ingestion into your chosen readr
Examples
## Not run:
read.csv(csv_from_zip("https://files.digital.nhs.uk/assets/ods/current/pcodeall.zip"))
## End(Not run)
Bacterial Genus Gram Stain Lookup Table
Description
A reference table of bacterial gram stain results by genus to allow faster filtering of bacterial results. This dataset has been maintained manually against the PHE SGSS database. If there are organisms missing, please raise and issue or push request on the epidm GitHub
Usage
genus_gram_stain
Format
A data frame with four columns
- organism_genus
The bacterial genus
- gram_stain
A character string to indicate POSITIVE or NEGATIVE type
- gram_positive
A 0/1 flag to indicate if the genus is gram positive
- gram_negative
A 0/1 flag to indicate if the genus is gram negative
A&E attendance discharge destination
Description
In order to group A&E discharge destination from SNOWMED into human readable groups, a lookup table has been created. These work with Emergency Care Dataset (ECDS) data with the destination_code field to show where a patient goes after discharge from A&E.
Usage
group_ecds_discharge_destination
Format
- code
the ECDS destination_code
- destination_code
the destination grouping as a human readable string
Inpatient admission methods
Description
In order to group hospital inpatient admissions into human readable groups, a lookup table has been created. These work with Hospital Episode Statistics (HES) and Secondary Use Services (SUS) data with the admission_method fields.
Usage
group_inpatient_admission_method
Format
- code
the admission_method code
- admission_method
the admission_method grouping as a human readable string
Inpatient discharge destination
Description
In order to group hospital inpatient discharge destination into human readable groups, a lookup table has been created. These work with Hospital Episode Statistics (HES) and Secondary Use Services (SUS) data with the discharge_destination fields.
Usage
group_inpatient_discharge_destination
Format
- code
the discharge_destination code
- discharge_destination
the discharge_destination grouping as a human readable string
Grouping of intervals or events that occur close together in time
Description
A utility function to group together observations that represent overlapping date intervals (e.g., hospital admission spells) or events occurring within a defined time window e.g., specimen dates grouped into infection episodes. The function supports both:
• Interval-based grouping: records have a start and an end date; any overlapping intervals are grouped together. • Event-based grouping: records have only a start date and are grouped using either a static or rolling time window.
The output provides a unique index per group and the minimum/maximum date that define the resulting aggregated episode or interval.
Usage
group_time(
x,
date_start,
date_end,
window,
window_type = c("rolling", "static"),
group_vars,
indx_varname = "indx",
min_varname = "date_min",
max_varname = "date_max",
.forceCopy = FALSE
)
Arguments
x |
A data.frame or data.table containing date variables for grouping. Will be converted to a data.table internally. |
date_start |
Quoted column name giving the start date for each record. |
date_end |
column containing the end dates for the interval, quoted |
window |
an integer representing a time window in days which will be applied to the start date for grouping events |
window_type |
character, to determine if a 'rolling' or 'static' grouping method should be used when grouping events. A 'static' window will identify the first event, and all records X days from that event will be attributed to the same episode. Eg. in a 14 day window, if first event is on 01 Mar, and events on day 7 Mar and 14 Mar will be grouped, but an event starting 15 Mar days after will start a new episode. A 'rolling' window resets the day counter with each new event. Eg. Events on 01 Mar, 07 Mar, 14 Mar and 15 Mar are all included in a single episode, as will any additional events up until the 29 Mar (assuming a 14-day window). |
group_vars |
Character vector of quoted column names used to partition the data before grouping. |
indx_varname |
a character string to set variable name for the index column which provides a grouping key; default is indx |
min_varname |
a character string to set variable name for the time period minimum |
max_varname |
a character string set variable name for the time period maximum |
.forceCopy |
default FALSE; TRUE will force data.table to take a copy instead of editing the data without reference
|
Details
How the function works
The behaviour depends on whether date_end is supplied:
1. Interval-based grouping (start + end dates)
If both date_start and date_end are provided, the function identifies
overlapping intervals within the same group_vars grouping. Any intervals
that overlap are combined into a single episode.
This method is typically used for:
Hospital spells (HES/SUS)
Contact periods or inpatient stays
2. Event-based grouping (single-date events)
If only date_start is supplied, records are grouped using a time window
defined by the window argument.
Two approaches are supported:
-
window_type = "static"A fixed window is applied starting from the first event in the group. All events occurring within the window are grouped until a gap exceeds the threshold, at which point a new episode begins. -
window_type = "rolling"A dynamic window where each event extends the episode end point. An event is grouped as long as it occurs withinwindowdays of the most recent event in the same episode.
Handling of missing values
Records missing date_start cannot be grouped and are returned with indx
= NA. These rows are appended back to the final output.
Value
A data.table containing all original columns plus:
Workflow context
how group_time() might be used in a pipeline
1) SGSS specimen data – infection episode grouping (event-based)
After organism/specimen harmonisation (e.g., via lookup_recode()),
group_time() groups specimen dates into infection episodes using a defined
time window. This helps identify clusters of related positive tests for the
same patient and organism.
2) HES/SUS inpatient data – continuous spell grouping (interval-based)
When start and end dates of inpatient stays are available, group_time()
collapses overlapping intervals into a single continuous hospital spell.
This is used before linking SGSS infection episodes to inpatient activity.
3) Integration across datasets
The outputs from group_time() are used downstream to determine whether
infection events fall within or around periods of hospital care,
enabling combined SGSS–HES/SUS–ECDS analyses.
Examples
episode_test <- structure(
list(
pat_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 2L, 2L),
species = c(rep("E. coli",7),rep("K. pneumonia",7)),
spec_type = c(rep("Blood",7),rep("Blood",4),rep("Sputum",3)),
sp_date = structure(c(18262, 18263, 18281, 18282, 18262, 18263, 18281,
18265, 18270, 18281, 18283, 18259, 18260, 18281),
class = "Date")
),
row.names = c(NA, -14L), class = "data.frame")
group_time(x=episode_test,
date_start='sp_date',
window=14,
window_type = 'static',
indx_varname = 'static_indx',
group_vars=c('pat_id','species','spec_type'))[]
spell_test <- data.frame(
id = c(rep(99,6),rep(88,4),rep(3,3)),
provider = c("YXZ",rep("ZXY",5),rep("XYZ",4),rep("YZX",3)),
spell_start = as.Date(
c(
"2020-03-01",
"2020-07-07",
"2020-02-08",
"2020-04-28",
"2020-03-15",
"2020-07-01",
"2020-01-01",
"2020-01-12",
"2019-12-25",
"2020-03-28",
"2020-01-01",
rep(NA,2)
)
),
spell_end = as.Date(
c(
"2020-03-10",
"2020-07-26",
"2020-05-22",
"2020-04-30",
"2020-05-20",
"2020-07-08",
"2020-01-23",
"2020-03-30",
"2020-01-02",
"2020-04-20",
"2020-01-01",
rep(NA,2)
)
)
)
group_time(x = spell_test,
date_start = 'spell_start',
date_end = 'spell_end',
group_vars = c('id','provider'),
indx_varname = 'spell_id',
min_varname = 'spell_min_date',
max_varname = 'spell_max_date')[]
Hospital IN/OUT dates
Description
Derives per‑patient hospital entry (hospital_in) and exit (hospital_out)
dates by reconciling A&E (ECDS) attendances and inpatient (HES/SUS) spells.
Applies a simple ranking to determine the most relevant hospital period around
an index event date (e.g., a specimen collection date).
- "1"
Current admissions take priority
- "2"
When conflicting on the same day, inpatient admissions take priority over A&E emergency care data
- "3"
Where a patient has a linked A&E admission to a hospital inpatient stay, the A&E admission date is used
- "4"
Where a patient has a positive test between two hospital stays the most recent completed hospital stay prior to the test is retained except if the time between these events is greater than 14 days, then the first admission following the test is retained
Usage
hospital_in_out_dates(
data,
person_id = "id",
hospital = list(org_code = "organisation_code_of_provider", event_date = "ev_date",
ae_arrive = "arrival_date", ae_depart = "departure_date", ae_discharge =
"ecds_discharge", in_spell_start = "spell_start_date", in_spell_end =
"spell_end_date", in_discharge = "discharge_destination")
)
Arguments
data |
A linked table containing A&E and inpatient records (typically the
output of |
person_id |
Quoted column name for the unique patient identifier. |
hospital |
A named list specifying column names (all quoted) for:
|
Value
A data.table equal to data with additional columns:
hospital_inDerived hospital admission date for the relevant stay.
hospital_outDerived hospital discharge date for the relevant stay.
hospital_event_rankRank of suitability of the hospital window for the given person/event (1 = most suitable).
Workflow context
Use hospital_in_out_dates() after:
Linking A&E to inpatient spells (e.g., via
link_ae_inpatient()),Constructing spells (e.g.,
group_time()orcip_spells()),Optional code standardisation (e.g., discharge groups via
lookup_recode())
Note
Work in progress — functionality is incomplete.
See Also
epidm::lookup_recode()
epidm::group_time()
epidm::cip_spells()
Inpatient Codes cleanup
Description
When HES/SUS ICD/OPCS codes are provided in wide format you may want to clean them up into long for easier analysis. This function helps by reshaping long as a separate table. Ensuring they're separate allows you to retain source data, and aggregate appropriately later.
Usage
inpatient_codes(
x,
field_strings,
patient_id_vars,
type = c("icd9", "icd10", "opcs"),
.forceCopy = FALSE
)
Arguments
x |
a data.frame or data.table containing inpatient data |
field_strings |
a vector or string containing the regex for the the columns |
patient_id_vars |
a vector containing colnames used to identify a patient episode or spell |
type |
a string to denote if the codes are diagnostic or procedural |
.forceCopy |
Logical (default |
Value
a separate table with codes and id in long form
Examples
# Example inpatient dataset
inpatient_test <- data.frame(
id = c(1053L, 5487L, 8180L),
spell_id = c("dwPDw", "iSpUq", "qpgk5"),
primary_diagnosis_code = c("K602", "U071-", "I501"),
procedure_code = c("H201", "H251", NA),
procedure_date = as.Date(c("2023-01-01", "2023-01-04", NA))
)
# ICD-10 cleaning example
inpatient_codes(
x = inpatient_test,
field_strings = "diagnosis",
patient_id_vars = c("id", "spell_id"),
type = "icd10"
)
# OPCS cleaning example
inpatient_codes(
x = inpatient_test,
field_strings = c("procedure_code", "procedure_date"),
patient_id_vars = c("id", "spell_id"),
type = "opcs"
)
Synthetic Lab Data for epidm
Description
A dataset containing synthetic lab data for testing epidemiological data transformation functions.
A dataset containing synthetic lab data for testing epidemiological data transformation functions.
Usage
data(lab_data)
data(lab_data)
Format
A data frame with the following columns:
- nhs_number
NHS number
- local_patient_identifier
Patient identifier such as hospital number
- patient_birth_date
Date of birth of the patients.
- sex
Gender of the patients (Factor with levels: "Female", "Male").
- surname
Patient surname
- forename
Patient forename
- organism_species_name
Organism species name (Factor with levels: "KLEBSIELLA PNEUMONIAE").
- specimen_date
Date of specimen collection.
- specimen_type
Type of specimen: BLOOD or URINE.
- lab_code
Laboratory codes (Factor with unique levels).
- local_authority_name
Name of the local authority.
- local_authority_code
Code of the local authority.
- postcode
Postcode
A data frame with the following columns:
- nhs_number
NHS number
- local_patient_identifier
Patient identifier such as hospital number
- patient_birth_date
Date of birth of the patients.
- sex
Gender of the patients (Factor with levels: "Female", "Male").
- surname
Patient surname
- forename
Patient forename
- organism_species_name
Organism species name (Factor with levels: "KLEBSIELLA PNEUMONIAE").
- specimen_date
Date of specimen collection.
- specimen_type
Type of specimen: BLOOD or URINE.
- lab_code
Laboratory codes (Factor with unique levels).
- local_authority_name
Name of the local authority.
- local_authority_code
Code of the local authority.
- postcode
Postcode
Examples
data(lab_data)
head(lab_data)
data(lab_data)
head(lab_data)
Link A&E (ECDS) records to Inpatient (HES/SUS) spells
Description
Links Emergency Care Data Set (ECDS) A&E records to inpatient spells (HES/SUS) using patient identifiers (NHS number, hospital number, date of birth), organisation code, and a link date. The inpatient records should already be aggregated to the desired spell level (e.g., provider spell, CIP, or “mega” spell). The output is a patient-level linked table suitable for downstream pathway analysis.
Usage
link_ae_inpatient(
ae = list(data, record_id = "unique_record_id", nhs_number = "nhs_number",
hospital_number = "local_patient_identifier", patient_dob = "patient_birth_date",
org_code = "organisation_code_of_provider", arrival_date = "arrival_date",
departure_date = "departure_date"),
inp = list(data, record_id = "unique_record_id", nhs_number = "nhs_number",
hospital_number = "local_patient_identifier", patient_dob = "date_birth", org_code =
"organisation_code_code_of_provider", spell_id = "mega_spell_id", spell_start_date =
"spell_start_date"),
.forceCopy = FALSE
)
Arguments
ae |
A named list describing the A&E (ECDS) input with quoted column names:
|
inp |
A named list describing the inpatient (HES/SUS) input with quoted column names:
|
.forceCopy |
Logical (default |
Value
A linked data.table at patient/spell level containing:
Harmonised identifiers (NHS number / hospital number, DOB, org code)
A&E arrival/departure dates
Inpatient spell identifiers and spell start date
(If supplied) preserved
record_idcolumns (tagged*_ae/*_inp)
Workflow context
Use link_ae_inpatient() after:
ECDS A&E data are cleaned/standardised (e.g., discharge categories via
lookup_recode()).Inpatient records have been aggregated to spells (e.g.,
cip_spells()for continuous inpatient spells).
Use it before:
Rejoining to SGSS infection episodes (e.g., outputs of
group_time()), or derivinghospital_in_out_datesfor entry/exit timelines.
See Also
group_time continuous_inpatient_spells
Examples
## Not run:
sample_ae <- data.table::data.table(
nhs_number = c("645114517",
"645114517","645114517","382940103","321908341",
"321908341","321908341","599534707",
"403454211","349959089","252341591","180554160",
"180554160","281980720","473372796","369773534",
"474044124","639064842","662468568","115641745",
"115641745","821084975","479637024",
"527021626","527021626","233168855","702650869",NA,
"537235036",NA,"517229961",NA,"480142132",
"368288558","554675340"),
local_patient_identifier = c(NA,NA,NA,
"I3348707",NA,NA,NA,"P1350948",NA,NA,
"Q4157514",NA,NA,"D1101843","K2440769","E1366499",
NA,"K1494229","R4678220","J5206297","J5206297",
"S1945338","F2159102",NA,NA,"D6300794",NA,
"W1208900","Z4975449","G7439612","T1266485",
"N4842033","Q5566884","P2689566",NA),
patient_birth_date = c("2021-03-03",
"2021-03-03","2021-03-03","2003-08-24",
"2001-06-21","2001-06-21","2001-06-21","1991-10-08",
"1987-02-03","1962-06-17","1991-10-07",
"1985-10-16","1985-10-16","1990-09-24","1984-11-14",
"1994-05-05","1999-08-11","1983-01-04",
"2017-06-01","1975-09-04","1975-09-04","1993-07-13",
"2014-01-05","1995-09-30","1995-09-30",
"1976-06-25","2000-06-02","2017-06-11","2007-05-03",
"1986-08-28","2016-10-14","2016-02-04",
"2004-03-02","1979-01-17","1974-06-14"),
organisation_code = c("P3P","P3P",
"P3P","Z4R","A9I","A9I","A9I","Z4R","V5T",
"Z9V","P3P","T7N","T7N","V2P","T4H","V9Y",
"Z7N","W6Y","G2H","V5T","V5T","W6Y","J6J",
"J6J","J6J","L4Q","P3P","F0N","A6C","O1A",
"F0N","O2R","W6Y","V0R","O1A"),
arrival_date = c("2022-05-10",
"2022-05-29","2022-08-03","2022-05-17",
"2022-05-07","2022-05-07","2022-05-07","2022-05-23",
"2022-05-13","2022-06-04","2022-05-14",
"2022-05-17","2022-06-05","2022-05-25","2022-05-24",
"2022-06-09","2022-06-18","2022-06-11",
"2022-06-22","2022-06-20","2022-07-18","2022-06-09",
"2022-06-05","2022-06-26","2022-06-11",
"2022-06-25","2022-06-10","2022-06-12","2022-06-16",
"2022-07-10","2022-06-20","2022-07-10",
"2022-07-20","2022-07-20","2022-07-19"),
departure_date = c("2022-05-10",
"2022-05-30","2022-08-03","2022-05-17",
"2022-05-07","2022-05-07","2022-05-07","2022-05-23",
"2022-05-13","2022-06-04","2022-05-14",
"2022-05-17","2022-06-05","2022-05-25","2022-05-24",
"2022-06-09","2022-06-18","2022-06-11",
"2022-06-22","2022-06-20","2022-07-18","2022-06-09",
"2022-06-05","2022-06-26","2022-06-11",
"2022-06-25","2022-06-10","2022-06-12","2022-06-16",
"2022-07-11","2022-06-20","2022-07-10",
"2022-07-20","2022-07-20","2022-07-19")
)
sample_ae$pcd <- paste0(sample(LETTERS,1),sample(1:14,1)," ",
sample(LETTERS,1),sample(0:9,1),sample(LETTERS,1))
sample_ae$id = seq_len(nrow(sample_ae))*7
sample_inp <- data.table::data.table(
nhs_number = c("335661151",
"335661151","335661151","335661151","335661151",
NA,NA,NA,NA,NA,NA,"645114517","645114517",
"645114517","143423716","212261130",
"212261130","212261130","212261130","212261130",
"349959089","317344169","317344169","317344169",
"317344169","317344169","317344169","317344169",
"317344169","317344169","317344169",
"317344169","317344169","317344169","180554160",
"180554160",NA,NA,NA,NA,NA,NA,NA,"230782291",
"977111015","977111015","977111015","977111015",
"683785606",NA,NA,NA,NA,NA,NA,NA,NA,NA,
NA,NA,"281980720","270646497",NA,
"387252583","639064842","836297039","836297039",
"348614531","348614531","662468568","112340924",
"112340924","381361439","493239044",NA,NA,NA,
"115641745","115641745","233761482","233761482",
"479637024","527021626","527021626",
"294666415","233168855","702650869","460180094",
"561169746","517229961",NA,"480142132","554675340",
"135888675",NA,"684718902"),
local_patient_identifier = c(NA,NA,NA,NA,
NA,"D4809270","D4809270","D4809270",
"D4809270","D4809270","D4809270",NA,NA,NA,
"J2098200","D2139084","D2139084","D2139084","D2139084",
"D2139084",NA,NA,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA,NA,NA,"A1706089",
"A1706089","A1706089","A1706089","A1706089",
"A1706089","A1706089","Z3093435","I1605735","I1888797",
"I1888797","I1888797","J2901593","U1111563",
"U1111563","U1111563","U1111563","V2246708",
"V2246708","V2246708","V2246708","V2246708",
"V2246708","V2246708","D1101843",NA,"O2700100",
"I5040881","K1494229","I1222012","I1222012",
NA,NA,"R4678220","P2632883","P2632883",
"J6723431","Y1506318","F3501197","F3501197",
"F3501197","J5206297","J5206297","B2651449",
"B2651449","F2159102",NA,NA,"W5097806","D6300794",
NA,"U2715517","O5278248","T1266485",
"N4842033","Q5566884",NA,"X2768295","H3196212",
"J9365439"),
date_birth = c("2021-08-14",
"2021-08-14","2021-08-14","2021-08-14",
"2021-08-14","1960-05-20","1960-05-20","1960-05-20",
"1960-05-20","1960-05-20","1960-05-20",
"2021-03-03","2021-03-03","2021-03-03","2019-11-09",
"1953-04-05","1953-04-05","1953-04-05",
"1953-04-05","1953-04-05","1962-06-17","1952-04-03",
"1952-04-03","1952-04-03","1952-04-03",
"1952-04-03","1952-04-03","1952-04-03","1952-04-03",
"1952-04-03","1952-04-03","1952-04-03",
"1952-04-03","1952-04-03","1985-10-16","1985-10-16",
"1993-07-09","1993-07-09","1993-07-09",
"1993-07-09","1993-07-09","1993-07-09","1993-07-09",
"1980-10-14","1976-08-03","1976-08-03",
"1976-08-03","1976-08-03","1981-08-27","2017-08-20",
"2017-08-20","2017-08-20","2017-08-20",
"2017-08-20","2017-08-20","2017-08-20","2017-08-20",
"2017-08-20","2017-08-20","2017-08-20",
"1989-07-11","1964-04-30","1991-12-25","1961-08-16",
"1983-01-04","1957-01-29","1957-01-29",
"1982-12-05","1982-12-05","2017-06-01","1989-09-21",
"1989-09-21","1986-10-06","1995-03-01",
"1964-04-25","1964-04-25","1964-04-25","1975-09-04",
"1975-09-04","1995-06-17","1995-06-17",
"2014-01-05","1995-09-30","1995-09-30","1993-06-09",
"1976-06-25","2000-06-02","1986-09-14",
"2016-11-19","2016-10-14","2016-02-04","2004-03-02",
"1974-06-14","1945-05-14","2001-09-16",
"1987-08-19"),
organisation_code = c("L4Q","L4Q",
"L4Q","P3P","P3P","U6X","U6X","U6X","U6X",
"U6X","U6X","P3P","P3P","P3P","L4Q","O2B",
"O2B","O2B","O2B","O2B","Z9V","U8V","U8V",
"U8V","U8V","U8V","U8V","U8V","U8V","U8V",
"U8V","U8V","U8V","U8V","T7N","T7N","V5T",
"V5T","V5T","V5T","V5T","V5T","V5T","V7E",
"J6J","Y9V","Y9V","Y9V","V7E","B1A","B1A",
"B1A","B1A","J2W","J2W","J2W","J2W","J2W",
"J2W","J2W","V2P","O1A","O2A","F1O","W6Y",
"T2Y","T2Y","G2H","G2H","G2H","J6J","J6J",
"J6J","V5T","G2H","G2H","G2H","V5T","V5T",
"T4H","T4H","J6J","J6J","J6J","G7H","L4Q",
"P3P","L4Q","U8V","F0N","O2R","W6Y","O1A",
"A9V","G9V","L4Q"),
mega_spell_id = c("10.3.0",
"10.3.1","10.3.2","4.2.0","4.2.1","7.12.0",
"7.12.1","7.12.2","7.12.3","7.12.4","7.12.5",
"14.3.0","14.3.1","14.3.2","22.2.1","49.6.0",
"49.6.1","49.6.2","49.6.3","49.6.4","69.1.0",
"76.20.0","76.20.1","76.20.2","76.20.3",
"76.20.4","76.20.5","76.20.6","76.20.7","76.20.8",
"76.20.9","76.20.10","76.20.11","76.20.12",
"77.7.0","77.7.1","79.7.0","79.7.1","79.7.2",
"79.7.3","79.7.4","79.7.5","79.7.6","83.1.0",
"157.1.0","85.5.0","85.5.1","85.5.2","90.1.0",
"167.4.0","167.4.1","167.4.2","167.4.3",
"91.9.0","91.9.1","91.9.2","91.9.3","91.9.4",
"91.9.5","91.9.6","101.2.0","111.5.0","122.1.0",
"151.1.0","154.1.0","161.3.0","161.3.1",
"181.4.0","181.4.1","184.1.0","185.2.0","185.2.1",
"201.1.0","214.1.0","226.3.0","226.3.1",
"226.3.2","247.4.0","247.4.1","266.4.0","266.4.1",
"269.2.0","270.2.0","270.2.1","284.1.0",
"299.2.0","307.1.0","314.3.0","345.1.0",
"400.1.0","419.1.0","430.3.1","494.3.0","498.1.0",
"501.1.0","535.1.0"),
spell_start_date = c("2022-05-20",
"2022-06-14","2022-06-20","2022-05-01",
"2022-05-07","2022-05-16","2022-05-29","2022-05-18",
"2022-05-21","2022-06-27","2022-07-18",
"2022-05-10","2022-05-29","2022-08-03","2022-05-13",
"2022-05-27","2022-05-20","2022-06-09",
"2022-06-27","2022-07-27","2022-06-04","2022-05-21",
"2022-06-18","2022-06-24","2022-07-16",
"2022-07-10","2022-07-17","2022-07-20","2022-07-15",
"2022-07-08","2022-08-01","2022-08-04",
"2022-07-27","2022-07-14","2022-05-17","2022-06-05",
"2022-06-01","2022-06-05","2022-06-09",
"2022-05-23","2022-05-27","2022-06-10","2022-06-12",
"2022-05-29","2022-05-29","2022-06-02",
"2022-05-29","2022-06-15","2022-06-09","2022-06-21",
"2022-05-29","2022-07-18","2022-07-26",
"2022-05-12","2022-06-12","2022-06-11","2022-06-28",
"2022-06-29","2022-06-22","2022-07-09",
"2022-05-19","2022-05-25","2022-05-18","2022-05-26",
"2022-06-11","2022-06-21","2022-06-13",
"2022-05-27","2022-06-27","2022-06-22","2022-06-08",
"2022-07-04","2022-06-26","2022-06-26",
"2022-06-11","2022-06-17","2022-07-13","2022-06-20",
"2022-07-18","2022-06-15","2022-07-03",
"2022-06-05","2022-06-26","2022-06-11","2022-06-24",
"2022-06-25","2022-06-10","2022-07-01",
"2022-07-04","2022-06-20","2022-07-10","2022-07-20",
"2022-07-19","2022-07-28","2022-07-27",
"2022-07-10"),
spell_end_date = c("2022-06-17",
"2022-06-16","2022-06-20","2022-05-01",
"2022-05-11","2022-05-16","2022-05-29","2022-05-18",
"2022-05-21","2022-07-05","2022-07-18",
"2022-05-11","2022-06-01","2022-08-04","2022-05-13",
"2022-05-29","2022-05-21","2022-06-10",
"2022-06-30","2022-07-28","2022-06-05","2022-06-25",
"2022-06-18","2022-06-24","2022-07-16",
"2022-07-10","2022-07-17","2022-07-20","2022-07-15",
"2022-07-08","2022-08-01","2022-08-04",
"2022-07-27","2022-07-14","2022-05-30","2022-06-10",
"2022-06-01","2022-06-05","2022-06-09",
"2022-05-23","2022-05-27","2022-06-10","2022-06-12",
"2022-05-30","2022-06-05","2022-06-02",
"2022-06-01","2022-06-15","2022-06-15","2022-06-21",
"2022-05-29","2022-07-18","2022-07-26",
"2022-05-15","2022-06-12","2022-06-17","2022-06-28",
"2022-06-29","2022-06-23","2022-07-11",
"2022-05-19","2022-06-09","2022-05-23","2022-05-26",
"2022-06-20","2022-06-21","2022-06-14",
"2022-06-02","2022-06-27","2022-06-23","2022-06-08",
"2022-07-04","2022-06-26","2022-06-28",
"2022-06-11","2022-06-17","2022-07-13","2022-06-21",
"2022-07-20","2022-06-18","2022-07-03",
"2022-06-11","2022-06-26","2022-06-11","2022-06-24",
"2022-07-01","2022-06-10","2022-07-06",
"2022-07-06","2022-06-23","2022-07-11","2022-07-22",
"2022-07-22","2022-07-28","2022-07-27",
"2022-07-10")
)
sample_inp$pcd <- paste0(sample(LETTERS,1),sample(1:14,1)," ",
sample(LETTERS,1),sample(0:9,1),sample(LETTERS,1))
sample_inp$id = seq_len(nrow(sample_inp))*3
link_ae_inpatient(
ae = list(
data = sample_ae,
arrival_date = 'arrival_date',
departure_date = 'departure_date',
nhs_number = 'nhs_number',
hospital_number = 'local_patient_identifier',
patient_dob = 'patient_birth_date',
org_code = 'organisation_code'
),
inp = list(
data = sample_inp,
spell_id = 'mega_spell_id',
spell_start_date = 'spell_start_date',
nhs_number = 'nhs_number',
hospital_number = 'local_patient_identifier',
patient_dob = 'date_birth',
org_code = 'organisation_code'
)
)[]
## End(Not run)
Lookup table switch handler
Description
A function to recode values via named lookup tables (i.e call an epidm lookup table and recode where we are aware of a new value). It routes to a specific lookup based on type, returning a character vector where each input value has been mapped to its corresponding replacement. If a value is not found in the lookup then the original value is returned.
Built‑in lookups include:
-
species: Uses therespeciate_organismdataset to standardise and reclassify organism names (e.g., historic → current nomenclature). This supports consistent reporting across SGSS and other laboratory datasets. -
specimen: Uses thespecimen_type_groupingdataset to assign raw laboratory specimen types into harmonised specimen groups. This enables consistent grouping for reporting, aggregation, and filtering. -
genus_gram_stain: Uses thegenus_gram_stainlookup table, which provides Gram stain classifications by bacterial genus. This reference is manually maintained against the UKHSA SGSS database and supports rapid filtering and high‑level organism categorisation. Users should raise an issue or submit a pull request to theepidmGitHub repository if an organism/genus is missing. -
lab_data: Uses thelab_datalookup dataset for harmonising laboratory code systems and internal SGSS mappings, supporting standardised laboratory result interpretation within surveillance pipelines. -
inpatient_admission_method: Uses the internal lookup tableepidm:::group_inpatient_admission_methodto categorise raw hospital admission method codes into operationally meaningful groups. -
inpatient_discharge_destination: Uses the internal tableepidm:::group_inpatient_discharge_destinationto group hospital discharge destination codes into standardised categories for inpatient pathway analysis. -
ecds_destination_code: Uses the internal tableepidm:::group_ecds_discharge_destination, providing grouped mappings for ECDS (Emergency Care Data Set) discharge codes. -
manual: Allows the user to supply their own lookup through.import = list(new, old). This is useful when working with local, provisional, or evolving code sets not yet included in the package’s centralised lookup tables.
Usage
lookup_recode(
src,
type = c("species", "specimen", "inpatient_admission_method",
"inpatient_discharge_destination", "ecds_destination_code", "manual"),
.import = NULL
)
Arguments
src |
Character vector (or column) of values to recode. Coerced to character if needed. |
type |
Character scalar specifying the lookup to use. One of:
|
.import |
A two‑element list in the format |
Value
A character vector containing the recoded values, aligned 1:1 with src.
Values not present in the lookup are returned unchanged.
Examples
df <- data.frame(
spec = c(
sample(grep(")",
respeciate_organism$previous_organism_name,
value=TRUE,
invert = TRUE),
9),
"ESCHERICHIA COLI","SARS-COV-2","CANDIDA AUREUS"),
type = sample(specimen_type_grouping$specimen_type,12),
date = sample(seq.Date(from = Sys.Date()-365,
to = Sys.Date(),
by = "day"),12)
)
df <- df[order(df$date),]
# show the data before the changes
df
# check the lookup tables
# observe the changes
head(respeciate_organism[1:2])
df$species <- lookup_recode(df$spec,'species')
df[,c('spec','species','date')]
head(specimen_type_grouping)
df$grp <- lookup_recode(df$type,'specimen')
df[,c('species','type','grp','date')]
# for a tidyverse use
# df %>% mutate(spec=lookup_recode(spec,'species))
# manual input of your own lookup
# .import=list(new,old)
lookup_recode(
"ALCALIGENES DENITRIFICANS",
type = 'manual',
.import=list(respeciate_organism$organism_species_name,
respeciate_organism$previous_organism_name)
)
Clean and Impute HES/SUS Episode Start and End Dates
Description
A utility for cleaning and imputing missing or inconsistent episode end dates
in HES/SUS–style inpatient data. The function identifies missing, invalid,
or overlapping spell dates within patient/provider groups and applies
deterministic rules to correct them. It also assigns a flag (proxy_missing)
indicating whether a value was modified and why.
Usage
proxy_episode_dates(
x,
group_vars,
spell_start_date,
spell_end_date,
discharge_destination,
.dropTmp = TRUE,
.forceCopy = FALSE
)
Arguments
x |
A |
group_vars |
Character vector of grouping variables (e.g., patient ID, provider). At least one identifier must be supplied. |
spell_start_date |
Name of the column containing the episode or spell start date. |
spell_end_date |
Name of the column containing the episode or spell end date. |
discharge_destination |
Name of the column containing the CDS discharge destination code. |
.dropTmp |
Logical (default |
.forceCopy |
Logical (default |
Value
A data.table containing:
Cleaned spell start and end dates.
A flag variable (
proxy_missing) indicating whether a date was modified and the rule applied (0–4).
Examples
proxy_test <- data.frame(
id = c(
rep(3051, 4),
rep(7835,3),
rep(9891,3),
rep(1236,3)
),
provider = c(
rep("QKJ", 4),
rep("JSD",3),
rep("YJG",3),
rep("LJG",3)
),
spell_start = as.Date(c(
"2020-07-03", "2020-07-14", "2020-07-23", "2020-08-05",
"2020-11-01", "2020-11-13", "2020-12-01",
"2020-03-28", "2020-04-06", "2020-04-09",
"2020-10-06", "2020-11-05", "2020-12-25"
)),
spell_end = as.Date(c(
"2020-07-11", "2020-07-22", "2020-07-30", "2020-07-30",
"2020-11-11", NA, "2020-12-03",
"2020-03-28", NA, "2020-04-09",
"2020-10-06", "2020-11-05", NA
)),
disdest = c(
19, 19, 51, 19,
19, 19, 19,
51, 98, 19,
19, 19, 98
)
)
proxy_episode_dates(
x=proxy_test,
group_vars = c('id','provider'),
spell_start_date = 'spell_start',
spell_end_date = 'spell_end',
discharge_destination = 'disdest'
)[]
Respeciate unspecified samples
Description
Some samples within SGSS are submitted by laboratories as "GENUS SP" or "GENUS UNNAMED". However, they may also have a fully identified sample taken from the same site within a recent time period. This function captures species_col from another sample within X-days of an unspeciated isolate. Respeciation is restricted to organisms of the same genus; species will not be inferred from isolates belonging to a different genus. Trailing "UNNAMED" is normalised to "SP" before any processing.
Usage
respeciate_generic(
x,
group_vars,
species_col,
date_col,
window = c(0:Inf),
.forceCopy = FALSE
)
Arguments
x |
a data.frame or data.table object |
group_vars |
the minimum grouping set of variables for like samples in a character vector; suggest c('patient_id','specimen_type') - genus will automatically be included in the groupby. This is built from the species_col |
species_col |
a character containing the column with the organism species_col name |
date_col |
a character containing the column with the specimen/sample date_col |
window |
an integer representing the number of days for which you will allow a sample to be respeciated |
.forceCopy |
Logical (default |
Value
a data.table with a recharacterised species_col column
Examples
df <- data.frame(
ptid = c(round(runif(25,1,5))),
spec = sample(c("KLEBSIELLA SP",
"KLEBSIELLA UNNAMED",
"KLEBSIELLA PNEUMONIAE",
"KLEBEIELLA OXYTOCA"),
25,replace = TRUE),
type = "BLOOD",
specdate = sample(seq.Date(Sys.Date()-21,Sys.Date(),"day"),25,replace = TRUE)
)
respeciate_generic(x=df,
group_vars=c('ptid','type'),
species_col='spec',
date_col='specdate',
window = 14)[]
Respeciated organisms
Description
Occasionally, research shows that two organisms, previously thought to be different are in fact one and the same. The reverse is also true. This is a manually updated list. If there are organisms missing, or new respeciates to be added, please raise and issue or push request on the epidm GitHub
Usage
respeciate_organism
Format
- previous_organism_name
What the organism used to be known as, in the form GENUS SPECIES
- organism_species_name
What the organism is known as now, in the form GENUS SPECIES
- organism_genus_name
The genus of the recoded organism
- genus_change
A 0/1 flag to indicate if the genus has changed
- genu_all_species
A 0/1 flag to indicate if all species under that genus should change
Specimen type grouping
Description
In order to help clean up an analysis based on a group of specimen types, a lookup table has been created to help group sampling sites. This is a manually updated list. If there are organisms missing, or new respeciates to be added, please raise and issue or push request on the epidm GitHub
Usage
specimen_type_grouping
Format
- specimen_type
The primary specimen type with detail
- specimen_group
A simple grouping of like specimen sites
Clean and Read a SQL query
Description
A utility function to read in a SQL query from a character object, clipboard or text file and remove all comments for use with database query packages
Usage
sql_clean(sql)
Arguments
sql |
a SQL file or text string |
Value
a cleaned SQL query without comments as a character string
Examples
testSQL <- c(
"/********* INTRO HEADER COMMENTS",
"*********/",
" SELECT ",
" [VAR 1] -- with comments",
",[VAR 2]",",[VAR 3]",
"FROM DATASET ","-- output here")
sql_clean(testSQL)
Connect to a SQL database
Description
An function to help setup connections to SQL databases acting as a wrapper for the odbc and DBI packages. Used by other sql_* tools within epidm. This uses the credential manager within the system and assumes you are using a trusted connection.
Usage
sql_connect(server, database)
Arguments
server |
a string containing the server connection;
note that servers may require the use of double backslash |
database |
a string containing the database name within the data store |
Value
a SQL connection object
See Also
sql_clean sql_read sql_write
Examples
## Not run:
sql <- list(
dsn = list(ser = 'covid.ukhsa.gov.uk',
dbn = 'infections')
)
sgss_con = sql_connect(server = sql$dsn$ser, database = sql$dsn$dbn)
## End(Not run)
Read a table from a SQL database
Description
Read a table object to a SQL database. Acts a wrapper for odbc and DBI packages.
Usage
sql_read(server, database, sql)
Arguments
server |
a string containing the server connection |
database |
a string containing the database name within the data store |
sql |
a string containing a SQL query or to a .sql/.txt SQL query |
Value
a table from a SQL database
See Also
sql_clean sql_connect
Write a table to a SQL database
Description
Write a table object to a SQL database. Acts a wrapper for odbc and DBI packages with additional checks to ensure upload completes.
Usage
sql_write(x, server, database, tablename)
Arguments
x |
a data.frame/data.table/tibble object |
server |
a string containing the server connection |
database |
a string containing the database name within the data store |
tablename |
a string containing the chosen SQL database table name |
Value
writes a data.frame/data.table/tibble to a SQL database
Patient ID record grouping
Description
Assigns a single integer id to records that belong to the same patient by
applying a sequence of deterministic matching stages across common identifiers
(NHS number, hospital number, DOB, name, sex, postcode). Identifiers are
standardised, validated using NHS checksum function, and fuzzy name keys are
used in later stages.
'
Matching is performed in order through the following stages (first match is applied):
NHS number + date of birth
Hospital number + date of birth
NHS number + hospital number
NHS number + surname
Hospital number + surname
Date of birth + surname (only where NHS is invalid/absent)
Sex + full name (forename + surname)
Sex + date of birth + fuzzy name (Soundex; surname + initial)
Date of birth (YYYY-MM) + fuzzy name
Surname/forename + postcode
Name swaps (forename/surname reversed) + date of birth
Use .useStages to restrict which stages are applied (default: 1:11).
The function generates a reproducible id per patient within the sort order;
you can provide .sortOrder (e.g., a date column) to make assignment deterministic.
Validity rules applied:
-
NHS number validated using the standard checksum (
epidm::valid_nhs()). -
Hospital number: excludes known placeholders (e.g.,
"UNKNOWN","NO PATIENT ID"). -
DOB: excludes proxy or missing dates (
"1900-01-01","1800-01-01",NA). -
Sex: normalised to
"M"/"F"; others →NA. -
Names: uppercased, Latin characters normalised; Soundex used for fuzzy matching.
Identifiers are copied over where they are missing or invalid to the grouped records.
Usage
uk_patient_id(
x,
id = list(nhs_number = "nhs_number", hospital_number = "patient_hospital_number",
date_of_birth = "date_of_birth", sex_mfu = "sex", forename = "forename", surname =
"surname", postcode = "postcode"),
.useStages = c(1:11),
.keepStages = FALSE,
.keepValidNHS = FALSE,
.sortOrder,
.forceCopy = FALSE
)
Arguments
x |
A |
id |
A named list of quoted column names:
|
.useStages |
optional, default 1:11; set to 1 if you wish patient ID to be assigned cases with the same DOB and NHS number, set to 2 if you wish patient ID to be assigned to cases with the same hospital number (HOS) and DOB, set to 3 if you wish patient ID to be assigned cases with the same NHS and HOS number, set to 4 if you wish patient ID to be assigned cases with the same NHS number and surname, set to 5 if you wish patient ID to be assigned cases with the same hospital number and surname, set to 6 if you wish patient ID to be assigned cases with the same DOB and surname, set to 7 if you wish patient ID to be assigned cases with the same sex and full name, set to 8 if you wish patient ID to be assigned cases with the same sex, DOB and fuzzy name, set to 9 if you wish patient ID to be assigned cases with the same DOB and fuzzy name, set to 10 if you wish patient ID to be assigned cases with the same name and postcode, set to 11 if you wish patient ID to be assigned cases with the same first name or second name in changing order and date of birth. |
.keepStages |
optional, default FALSE; to generate a new column (stageMatch) to retain the stage information for which the record matched the group. |
.keepValidNHS |
optional, default FALSE; set TRUE if you wish to retain the column with the NHS checksum result stored as a BOOLEAN |
.sortOrder |
optional; a column as a character to allow a sorting order on the id generation |
.forceCopy |
optional, default FALSE; TRUE will force data.table to take a copy instead of editing the data without reference |
Value
A data.table with the original columns plus:
idInteger patient identifier assigned by staged matching.
valid_nhs(Optional) BOOLEAN NHS checksum flag; included when
.keepValidNHS = TRUE.
Workflow context
uk_patient_id() is typically used early to harmonise patient identity
across isolates before downstream tasks such as specimen episode grouping
(group_time()), dataset linkage (e.g., to HES/SUS/ECDS), and
epidemiological reporting.
Examples
id_test <-
data.frame(
stringsAsFactors = FALSE,
record_id = c(1L,2L,3L,4L,
5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,
16L,17L,18L,19L,20L,21L,22L,23L,24L),
nhs_number = c(9435754422,
9435754422,NA,9435754422,5555555555,NA,
9435773982,NA,9999999999,NA,9435773982,NA,
9435802508,9435802508,NA,NA,9435802508,9435802508,NA,
3333333333,NA,9999999999,9435817777,
9435817777),
local_patient_identifier = c(NA,"IG12067",
NA,NA,"IG12067","IG12067","KR2535","KR2535",
"KR2535",NA,NA,NA,"UK8734","UK8734",NA,NA,
"UK8734","UK8734",NA,NA,"JH45204",
"HS45202","HS45202","JH45204"),
patient_birth_date = c("1993-07-16",
"1993-07-16","1993-07-16","1993-07-16",
"1993-07-16",NA,"1967-02-10",NA,"1967-02-10",NA,NA,
"1967-02-10",NA,NA,"1952-10-22","1952-10-22",
"1952-10-22",NA,"1947-09-14","1947-09-14",
"1947-09-14","1947-09-14","1947-09-14",
"1947-09-14"),
sex = c("Male","Male",
"Male","Male",NA,"Male","Female","Female",
"Female","Female","Female","Female","Male",
"Male","Male","Male","Male","Male","Male",
"Male","Male","Male",NA,"Male"),
forename = c(NA,"DENNIS",
NA,NA,"DENNIS",NA,"ELLIE","ELLIE",NA,
"ELLIE","ELLIE","ELLIE","IAN","IAN","MALCOLM",
"IAN","IAN",NA,"GRANT","ALAN","ALAN","ALAN",
"GRANT","ALAN"),
surname = c(NA,"NEDRY",
"NEDRY",NA,"NEDRY","NEDRY","SATTLER","SATTLER",
NA,"SATTLER","SATTLER","SATTLER","M",NA,
"IAN","MALCOLM","MALCOLM",NA,"ALAN","GRANT",
"GRANT","GRANT","ALAN","GRANT"),
postcode = c("HA4 0FF",
"HA4 0FF","HA4 0FF",NA,"HA4 0FF","HA4 0FF",
"L3 1DZ","L3 1DZ","L3 1DZ","L3 1DZ",NA,"L3 1DZ",
"BN14 9EP",NA,"BN14 9EP",NA,NA,NA,"CW6 9TX",
"CW6 9TX",NA,NA,NA,NA),
specimen_date = c("2024-08-14",
"2023-02-03","2023-02-07","2023-02-04",
"2023-02-09","2024-08-14","2021-03-28","2021-03-28",
"2021-03-28","2021-03-28","2021-03-28",
"2021-03-28","2024-07-06","2024-07-06","2024-07-06",
"2023-10-31","2023-10-31","2023-10-31",
"2022-01-23","2022-01-24","2022-01-25","2022-01-26",
"2022-01-27","2022-01-28")
)
data.table::setDT(id_test)
uk_patient_id(
x = id_test,
id = list(
nhs_number = 'nhs_number',
hospital_number = 'local_patient_identifier',
date_of_birth = 'patient_birth_date',
sex_mfu = 'sex',
forename = 'forename',
surname = 'surname',
postcode = 'postcode'
),
.sortOrder = 'specimen_date',
.useStages = c(1:11),
.keepStages = TRUE,
.forceCopy = TRUE)[]
NHS Number Validity Check
Description
Validates NHS numbers using the NHS checksum applied to the
first 9 digits and compared with the 10th digit. Inputs that are
not exactly 10 numeric characters are invalid. Numbers made of the same
repeated digit (e.g., "1111111111", "0000000000") are also rejected.
Usage
valid_nhs(nhs_number)
Arguments
nhs_number |
A vector of values to validate. Each element is coerced to character for length checking and digit extraction. |
Details
Algorithm (summary):
Take the first 9 digits and multiply by weights 10 down to 2 (i.e., d1×10 + d2×9 + … + d9×2).
Compute
11 - (sum %% 11)→ this is the expected check digit.If the expected check digit is 11, treat as 0.
Compare with the 10th digit. If they match, the number is valid.
Additional guards implemented:
If the NHS number is
NAor not 10 characters, it is invalid.If all 10 digits are identical (e.g.,
"1111111111"), it is invalid.
The function is vectorised and returns 1 for valid and 0 for invalid for each element in the input vector.
Value
A numeric vector of the same length as nhs_number containing:
-
1if the value is a valid NHS number -
0otherwise
Examples
test <- floor(runif(1000,1000000000,9999999999))
valid_nhs(test)
valid_nhs(9434765919)