The paneldesc package provides a comprehensive set of
tools for analyzing panel (longitudinal) data. It helps you explore the
structure of your panel, examine missing value patterns, decompose
numeric variables into between‑ and within‑entity components, and
analyze transitions in categorical variables. The package is designed to
work seamlessly with data frames that have been marked with panel
structure using make_panel(), reducing repetitive
specification of entity and time identifiers.
This vignette walks you through the basic workflow using the built‑in production dataset, a simulated unbalanced panel of firms over six years.
For a comprehensive guide with detailed examples, case studies, and extended tutorials, please visit the package web-book: https://dtereshch.github.io/paneldesc-guides/.
If you haven’t installed the package yet, you can get the stable version from CRAN.
Or you can install the development version from GitHub.
The package includes a simulated dataset called
production. It contains information on 30 firms over up to
6 years, with variables such as sales,
capital, labor, industry, and
ownership. Missing values are present in some variables to
mimic real‑world data.
To avoid repeatedly specifying the entity and time variables (firm
and year), we create a panel_data object using
make_panel(). This adds metadata that many subsequent
functions will automatically use.
The first group of functions is designed to analyze the structure of the panel.
describe_dimensions() returns the number of rows,
distinct entities, distinct time periods, and substantive variables.
describe_periods() shows, for each time period, how many
entities have non‑missing data in any substantive variable, along with
their share in the total number of entities.
describe_periods(panel)
#> year count share
#> 1 1 25 0.833
#> 2 2 28 0.933
#> 3 3 30 1.000
#> 4 4 29 0.967
#> 5 5 26 0.867
#> 6 6 19 0.633describe_balance() provides summary statistics for the
distribution of entities per period and periods per entity.
describe_balance(panel)
#> dimension mean std min max
#> 1 entities 26.167 3.971 19 30
#> 2 periods 5.233 0.935 3 6plot_periods() creates a histogram of the number of time
periods covered by each entity.
describe_patterns() tabulates the distinct patterns of
presence/absence across time (e.g., which entities appear in which
years).
describe_patterns(panel)
#> pattern 1 2 3 4 5 6 count share
#> 1 1 1 1 1 1 1 1 16 0.533
#> 2 2 1 1 1 1 1 0 5 0.167
#> 3 3 1 1 1 1 0 0 3 0.100
#> 4 4 0 0 1 1 1 1 2 0.067
#> 5 5 0 1 1 1 1 0 2 0.067
#> 6 6 0 1 1 1 1 1 1 0.033
#> 7 7 1 1 1 0 0 0 1 0.033You can also visualize these patterns with a heatmap using
plot_patterns().
The second group of functions is aimed at analyzing missing values, taking into account the nature of panel data.
plot_missing() creates a heatmap showing the number of
missing values for each variable across all time periods. Darker cells
indicate more missing values.
summarize_missing() returns a table with overall missing
counts, shares, and the number of entities and periods affected per
variable.
summarize_missing(panel)
#> Analyzing all variables: sales, capital, labor, industry, ownership
#> variable na_count na_share entities periods
#> 1 sales 26 0.144 15 5
#> 2 capital 26 0.144 17 5
#> 3 labor 26 0.144 16 6
#> 4 industry 23 0.128 14 5
#> 5 ownership 23 0.128 14 5describe_incomplete() lists entities that have at least
one missing value, with details on which variables are incomplete.
The third group of functions is aimed at analyzing numeric variables, taking into account the nature of panel data.
summarize_numeric() calculates basic statistics (count,
mean, std, min, max) for numeric variables.
summarize_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#> variable count mean std min max
#> 1 sales 154 69.756 46.804 8.321 336.853
#> 2 capital 154 32.490 31.053 0.968 194.719
#> 3 labor 154 79.329 73.687 4.097 419.848You can optionally group by another variable, which does not
necessarily have to be a panel identifier. Here we use
year.
summarize_numeric(panel, group = "year")
#> Analyzing all numeric variables: sales, capital, labor
#> year variable count mean std min max
#> 1 1 sales 25 58.491 44.590 8.321 190.100
#> 2 1 capital 24 24.862 16.273 0.968 65.950
#> 3 1 labor 25 68.871 66.941 4.097 246.852
#> 4 2 sales 28 56.099 37.944 17.803 186.349
#> 5 2 capital 27 28.790 31.053 3.150 151.464
#> 6 2 labor 27 60.463 48.484 11.692 222.761
#> 7 3 sales 30 76.660 47.574 20.580 219.513
#> 8 3 capital 30 35.464 39.174 4.729 194.719
#> 9 3 labor 29 90.437 82.628 9.284 414.844
#> 10 4 sales 28 73.104 33.238 19.455 135.118
#> 11 4 capital 29 44.522 35.375 5.080 132.898
#> 12 4 labor 29 73.967 54.005 16.327 240.726
#> 13 5 sales 24 75.398 43.091 20.161 211.092
#> 14 5 capital 25 28.351 23.127 5.339 86.078
#> 15 5 labor 26 90.604 85.026 21.063 413.784
#> 16 6 sales 19 81.744 73.320 20.352 336.853
#> 17 6 capital 19 29.767 30.908 2.288 108.787
#> 18 6 labor 18 96.609 103.777 20.507 419.848plot_heterogeneity() visualizes the distribution of a
numeric variable across groups. We use select = "sales" to
look at sales, and the function automatically uses the
entity and time variables as groups because panel has panel
attributes.
decompose_numeric() splits the total variance of numeric
variables into between‑entity and within‑entity components.
decompose_numeric(panel)
#> Analyzing all numeric variables: sales, capital, labor
#> variable dimension mean std min max count
#> 1 sales overall 69.756 46.804 8.321 336.853 154.000
#> 2 sales between NA 29.776 25.772 159.197 30.000
#> 3 sales within NA 35.862 -28.397 247.412 5.133
#> 4 capital overall 32.490 31.053 0.968 194.719 154.000
#> 5 capital between NA 13.969 8.671 75.083 30.000
#> 6 capital within NA 27.701 -22.444 152.126 5.133
#> 7 labor overall 79.329 73.687 4.097 419.848 154.000
#> 8 labor between NA 44.023 24.606 175.731 30.000
#> 9 labor within NA 59.561 -77.709 323.445 5.133The last group of functions is aimed at analyzing factor (categorical) variables, taking into account the nature of panel data.
decompose_factor() breaks down the overall frequency of
each category into between‑entity (how many entities ever have that
category) and within‑entity (average share of time an entity spends in
that category) components.
decompose_factor(panel)
#> Analyzing all factor variables: industry, ownership
#> variable category count_overall share_overall count_between share_between
#> 1 industry Industry 1 63 0.401 13 0.433
#> 2 industry Industry 2 45 0.287 11 0.367
#> 3 industry Industry 3 49 0.312 10 0.333
#> 4 ownership private 76 0.484 16 0.533
#> 5 ownership public 55 0.350 13 0.433
#> 6 ownership mixed 26 0.166 7 0.233
#> share_within
#> 1 0.918
#> 2 0.809
#> 3 0.917
#> 4 0.898
#> 5 0.813
#> 6 0.724summarize_transition() computes transition counts and
shares between states of a factor variable over consecutive time
periods. Here we analyze transitions in ownership.