Simulation Tools for Small Area Estimation

Tools for the simulation of data in the context of small area estimation. Combine all steps of your simulation - from data generation over drawing samples to model fitting - in one object. This enables easy modification and combination of different scenarios. You can store your results in a folder or start the simulation in parallel.

An initial Example

Consider a linear mixed model. It contains two components. A fixed effects part, and an error component. The error component can be split into a random effects part and a model error.

library(saeSim)
setup <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_gen_v() %>%
  sim_resp_eq(y = 100 + 2 * x + v + e) %>% 
  sim_simName("Doku")
setup

##   idD idU          x         e         v         y
## 1   1   1 -2.5058152 -3.217326 0.2353485  92.00639
## 2   1   2  0.7345733 -4.226103 0.2353485  97.47839
## 3   1   3 -3.3425144 -4.141583 0.2353485  89.40874
## 4   1   4  6.3811232 -4.742241 0.2353485 108.25535
## 5   1   5  1.3180311 -2.001758 0.2353485 100.86965
## 6   1   6 -3.2818735 -2.099955 0.2353485  91.57165

sim_base() is responsible to supply a data.frame to which variables can be added. The default is to create a data.frame with indicator variables idD and idU (2-level-model), which uniquely identify observations. ‘D’ stands for the domain, i.e. the grouping variable. ‘U’ stands for unit and is the identifier of single observations within domains. sim_resp will add a variable y as response.

The setup itself does not contain the simulated data but the functions to process the data. To start a simulation use the function sim. It will return a list containing data.frames as elements:

dataList <- sim(setup, R = 10)

You can coerce a simulation setup to a data.frame with as.data.frame.

simData <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  as.data.frame
simData

Naming and structure

Components in a simulation setup should be added using the pipe operator %>% from magrittr. A component defines ‘when’ a specific function will be applied in a chain of functions. To add a component you can use one of: sim_gen, sim_resp, sim_comp_pop, sim_sample, sim_comp_sample, sim_agg and sim_comp_agg. They all expect a simulation setup as first argument and a function as second and will take care of the order in which the functions are called. The reason for this is the following:

setup <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_resp_eq(y = 100 + 2 * x + e)

setup1 <- setup %>% sim_sample(sample_fraction(0.05))
setup2 <- setup %>% sim_sample(sample_number(5))

You can define a rudimentary scenario and only have to explain how scenarios differ. You do not have to redefine them. setup1 and setup2 only differ in the way samples are drawn. sim_sample will take care, that the sampling will take place at the appropriate place in the chain of functions no matter how setup was composed.

If you can’t remember all function names, use auto-complete. All functions to add components start with sim_. And all functions meant to be used in a given phase will start with the corresponding prefix, i.e. if you set the sampling scheme you use sim_sample – all functions to control sampling have the prefix sample.

Exploring the setup

You will want to check your results regularly when working with sim_setup objects. Some methods are supplied to do that:

show - this is the print method for S4-Classes. You don’t have to call show explicitly. You may have noticed that only a few lines of data are printed to the console if you evaluate a simulation setup. show will print the head of the resulting data of one simulation run.
plot - for visualizing the data
autoplot - Will imitate smoothScatter with ggplot2

setup <- sim_base_lmm()

plot(setup)
autoplot(setup)
autoplot(setup, "e")
autoplot(setup %>% sim_gen_vc())

sim_gen

Semi-custom data

saeSim has an interface to standard random number generators in R. If things get more complex you can always define new generator functions.

base_id(2, 3) %>% 
  sim_gen(gen_generic(rnorm, mean = 5, sd = 10, name = "x", groupVars = "idD"))

##   idD idU         x
## 1   1   1 -3.477177
## 2   1   2 -3.477177
## 3   1   3 -3.477177
## 4   2   1 13.219727
## 5   2   2 13.219727
## 6   2   3 13.219727

You can supply any random number generator to gen_generic and since we are in small area estimation you have an optional group variable to generate ‘area-level’ variables. Some short cuts for data generation are sim_gen_x, sim_gen_v and sim_gen_e which add normally distributed variables named ‘x’, ‘e’ and ‘v’ respectively. Also there are some function with the prefix ‘gen’ which will be extended in the future.

library(saeSim)
setup <- sim_base() %>% 
  sim_gen_x() %>% # Variable 'x'
  sim_gen_e() %>% # Variable 'e'
  sim_gen_v() %>% # Variable 'v' as a random-effect
  sim_gen(gen_v_sar(name = "vSp")) %>% # random-effect following a SAR(1)
  sim_resp_eq(y = 100 + x + v + vSp + e) # Computing 'y'
setup

##   idD idU         x         e         v      vSp         y
## 1   1   1 -3.261018 -5.797709 -1.293091 1.250643  90.89882
## 2   1   2  4.256857  4.773686 -1.293091 1.250643 108.98809
## 3   1   3 -3.071645  1.024299 -1.293091 1.250643  97.91021
## 4   1   4  5.208898  2.322837 -1.293091 1.250643 107.48929
## 5   1   5  1.908363  8.533583 -1.293091 1.250643 110.39950
## 6   1   6 -2.660212  3.759640 -1.293091 1.250643 101.05698

Contaminated data

For contaminated data you can use the same generator functions, however, instead of using sim_gen you use sim_gen_cont which will have some extra arguments to control the contamination. To extend the above setup with a contaminated spatially correlated error component you can add the following:

contSetup <- setup %>% 
  sim_gen_cont(
    gen_v_sar(sd = 40, name = "vSp"), # defining the model
    nCont = 0.05, # 5 per cent outliers
    type = "area", # whole areas are outliers, i.e. all obs within
    areaVar = "idD", # var name to identify domain
    fixed = TRUE # if in each iteration the same area is an outlier
  )

Note that the generator is the same but with a higher standard deviation. The argument nCont controls how much observations are contaminated. Values < 1 are interpreted as probability. A single number as the number of contaminated units (can be areas or observations in each area or observations). When given with length(nCont) > 1 it will be interpreted as number of contaminated observations in each area. Use the following example to see how these things play together:

base_id(3, 4) %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_gen_ec(mean = 0, sd = 150, name = "eCont", nCont = c(1, 2, 3)) %>%
  as.data.frame

##    idD idU          x           e      eCont   idC
## 1    1   1  2.1238928  2.85552209    0.00000 FALSE
## 2    1   2  1.8224797 -3.70320125    0.00000 FALSE
## 3    1   3 -0.7466757 -4.63509843    0.00000 FALSE
## 4    1   4  0.2864107 -0.51238608  174.16195  TRUE
## 5    2   1 -4.1711873 -1.64389880    0.00000 FALSE
## 6    2   2 -0.9948926  2.68706944    0.00000 FALSE
## 7    2   3 -4.4030395  0.08884468  131.30190  TRUE
## 8    2   4  1.4103819  0.77040266 -111.81580  TRUE
## 9    3   1 -0.3921937  0.88877112    0.00000 FALSE
## 10   3   2 -5.7884702 -1.54674283   47.70671  TRUE
## 11   3   3 -4.8635870  1.35621664 -306.40766  TRUE
## 12   3   4 -5.7893455  4.26584633  305.57991  TRUE

sim_comp

Here follow some examples how to add components for computation to a sim_setup. Three points can be accessed with

sim_comp_pop - add a computation before sampling
sim_comp_sample - add a computation after sampling
sim_comp_agg - add a computation after aggregation

base_id(2, 3) %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_gen_ec() %>% 
  sim_resp_eq(y = 100 + x + e) %>%
   # the mean in each domain:
  sim_comp_pop(comp_var(popMean = mean(y)), by = "idD")

##   idD idU          x         e   idC        y  popMean
## 1   1   1  0.5746668 6.6169891 FALSE 107.1917 108.2884
## 2   1   2  3.7688622 0.2257547 FALSE 103.9946 108.2884
## 3   1   3 12.6044689 1.0745909 FALSE 113.6791 108.2884
## 4   2   1  6.6417051 5.0116040 FALSE 111.6533 106.3608
## 5   2   2 -0.8421809 3.4132565 FALSE 102.5711 106.3608
## 6   2   3  4.4991934 0.3587781 FALSE 104.8580 106.3608

The function comp_var is a wrapper around dplyr::mutate so you can add simple data manipulations. The argument by is a little extension and lets you define operations in the scope of groups identified by a variable in the data. In this case the mean of the variable ‘y’ is computed for every group identified with the variable ‘idD’. This is done before sampling, hence the prefix ‘pop’ for population.

Add custom computation functions

By adding computation functions you can extend the functionality to wrap your whole simulation. This will separate the utility of this package from only generating data.

comp_linearPredictor <- function(dat) {
  dat$linearPredictor <- lm(y ~ x, dat) %>% predict
  dat
}

sim_base_lm() %>% 
  sim_comp_pop(comp_linearPredictor)

##   idD idU          x          e         y linearPredictor
## 1   1   1 -8.5041814 -0.6369861  90.85883        91.55255
## 2   1   2 -0.7317040  3.0870898 102.35539        99.25294
## 3   1   3  1.1077078  6.9927853 108.10049       101.07530
## 4   1   4  3.8226415  4.3782846 108.20093       103.76505
## 5   1   5  0.6150609 -0.1913978 100.42366       100.58722
## 6   1   6  0.3820305 -3.6001313  96.78190       100.35635

Or, should this be desirable, directly produce a list of lm objects or add them as attribute to the data. The intended way of writing functions is that they will return the modified data of class ‘data.frame’.

sim_base_lm() %>% 
  sim_comp_pop(function(dat) lm(y ~ x, dat)) %>%
  sim(R = 1)

## [[1]]
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##    100.0421       0.9872

comp_linearModelAsAttr <- function(dat) {
  attr(dat, "linearModel") <- lm(y ~ x, dat)
  dat
}

dat <- sim_base_lm() %>% 
  sim_comp_pop(comp_linearModelAsAttr) %>%
  as.data.frame

attr(dat, "linearModel")

## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##     100.017        1.032

If you use any kind of sampling, the ‘linearPredictor’ can be added after sampling. This is where small area models are supposed to be applied.

sim_base_lm() %>% 
  sim_sample() %>%
  sim_comp_sample(comp_linearPredictor)

##   idD idU          x         e         y linearPredictor
## 1   1   1 -2.3521855 -2.075083  95.57273        97.97613
## 2   1  94 -6.8884013 -2.397076  90.71452        93.20934
## 3   1  68  5.0285137 -3.870309 101.15820       105.73200
## 4   1  19  3.0967386  4.419374 107.51611       103.70203
## 5   1  27 -0.4615954  8.400955 107.93936        99.96282
## 6   2  95  5.0186243  1.586789 106.60541       105.72161

Should you want to apply area level models, use sim_comp_agg instead.

sim_base_lm() %>% 
  sim_sample() %>%
  sim_agg() %>% 
  sim_comp_agg(comp_linearPredictor)

##   idD          x          e         y linearPredictor
## 1   1 -2.9161403  1.7102560  98.79412        96.76717
## 2   2 -0.2108174 -0.1187314  99.67045        99.52888
## 3   3  2.6877059 -1.5974012 101.09030       102.48782
## 4   4  0.8245103 -4.4691303  96.35538       100.58579
## 5   5  1.3990762  0.5101492 101.90923       101.17233
## 6   6  1.1361814  2.2870048 103.42319       100.90395

sim_sample

After the data generation you may want to draw a sample from the population. Use the function sim_sample() to add a sampling component to your sim_setup.

sample_number - wrapper around dplyr::sample_n
sample_fraction - wrapper around dplyr::sample_frac

base_id(3, 4) %>% 
  sim_gen_x() %>% 
  sim_sample(sample_number(1L))

##      idD idU         x
## 1      2   3 -2.117038
## NA    NA  NA        NA
## NA.1  NA  NA        NA
## NA.2  NA  NA        NA
## NA.3  NA  NA        NA
## NA.4  NA  NA        NA

base_id(3, 4) %>% 
  sim_gen_x() %>% 
  sim_sample(sample_number(1L, groupVars = "idD"))

##      idD idU         x
## 1      1   3 -5.349785
## 2      2   1 -9.711665
## 3      3   2  1.483272
## NA    NA  NA        NA
## NA.1  NA  NA        NA
## NA.2  NA  NA        NA

# simple random sampling:
sim_base_lm() %>% sim_sample(sample_number(size = 10L))

##   idD idU          x         e         y
## 1  79  72  2.3706682  2.014138 104.38481
## 2  22  15  2.9998000  3.032253 106.03205
## 3  64  64 -0.2648422 -9.362967  90.37219
## 4  57  31  6.0910778 -3.184816 102.90626
## 5  53  27 -1.8359776  2.840800 101.00482
## 6  45  26  0.5487789 -9.372022  91.17676

sim_base_lm() %>% sim_sample(sample_fraction(size = 0.05))

##   idD idU          x         e         y
## 1  80   7 -3.7332804  7.142498 103.40922
## 2 100  74  9.6384784  1.852246 111.49072
## 3  95  49 -1.7732996 -6.654482  91.57222
## 4  62  95  5.9321181  1.967965 107.90008
## 5  76  80 -0.9007322  3.792310 102.89158
## 6   3  75  4.6494503 -3.962043 100.68741

# srs in each domain/cluster
sim_base_lm() %>% sim_sample(sample_number(size = 10L, groupVars = "idD"))

##   idD idU          x          e         y
## 1   1  13 -1.0937285   3.430671 102.33694
## 2   1  23 -5.0083024 -12.542080  82.44962
## 3   1  26 -0.6701565   1.254721 100.58456
## 4   1  68  5.7348578   5.134796 110.86965
## 5   1  47  0.7213554   6.893006 107.61436
## 6   1  15 -3.4577262   3.535397 100.07767

sim_base_lm() %>% sim_sample(sample_fraction(size = 0.05, groupVars = "idD"))

##   idD idU          x         e         y
## 1   1  18  2.4307740 -2.500055  99.93072
## 2   1  25 -0.3409896 -1.197272  98.46174
## 3   1  51  3.5957433  3.326504 106.92225
## 4   1  33 -1.0225583  3.621166 102.59861
## 5   1  16 -0.5729424 -8.278747  91.14831
## 6   2  49  6.0450908  1.109298 107.15439