A Tplyr
table is constructed of two main objects, a table_table
object and tplyr_layer
objects making up the different summaries that are to be performed.
tplyr_table
ObjectThe tplyr_table
object is the conceptual “table” that contains the all of the logic necessary to construct and display the data. tplyr
tables are made up of one or more layers. Each layer contains an instruction for a summary to be performed. The tplyr_table
object contains those layers, and the general data, metadata, and logic necessary to prepare the data before any layers are constructed.
When a tplyr_table
is created, it will contain the following bindings:
treat_var
)where
parameter provided, used to subset the target
datatreat_var
and any cols
variablespop_data
(if different)tplyr_table
tplyr_table
allows you a basic interface to instantiate the object. Modifier functions are available to change individual parameters catered to your analysis.
t <- tplyr_table(adsl, TRT01P, where = SAFFL == "Y") t #> *** tplyr_table *** #> Target (data.frame): #> Name: adsl #> Rows: 254 #> Columns: 49 #> pop_data (data.frame) #> Name: target #> Rows: 254 #> Columns: 49 #> treat_var variable (quosure) #> TRT01P #> header_n: header groups #> treat_grps groupings (list) #> Table Columns (cols): #> where: == SAFFL Y #> Number of layer(s): 0 #> layer_output: 0
tplyr_layer
ObjectUsers of Tplyr
interface with tplyr_layer
objects using the group_<type>
family of functions. This family specifies the type of summary that is to be performed within a layer. count
layers are used to create summary counts of some discrete variable. desc
layers create descriptive statistics, and shift
layers summaries the counts for different changes in states.
tplyr
already that you can perform, including n, mean, median, standard deviation, variance, min, max, interquartile range, Q1, Q3, and missing value counts. From these available summaries, the default presentation of a descriptive statistics layer will output ‘n’, ‘Mean (SD)’, ‘Median’, ‘Q1, Q3’, ‘Min, Max’, and ‘Missing’. You can change these summaries using set_format_strings
, and you can also add your own summaries using set_custom_summaries
. This allows you to easily implement any additional summary statistics you want presented.cnt <- group_count(t, AGEGR1) cnt #> *** count_layer *** #> Self: count_layer < 0x558d54adba30 > #> Parent: tplyr_table < 0x558d542f1830 > #> target_var: #> AGEGR1 #> by: #> where: TRUE #> Layer(s): 0 dsc <- group_desc(t, AGE) dsc #> *** desc_layer *** #> Self: desc_layer < 0x558d54d759d0 > #> Parent: tplyr_table < 0x558d542f1830 > #> target_var: #> AGE #> by: #> where: TRUE #> Layer(s): 0 shf <- group_shift(t, vars(row=COMP8FL, column=COMP24FL)) shf #> *** shift_layer *** #> Self: shift_layer < 0x558d528b1fd0 > #> Parent: tplyr_table < 0x558d542f1830 > #> target_var: #> COMP8FL #> COMP24FL #> by: #> where: TRUE #> Layer(s): 0
Everyone has their own style of coding - so we’ve tried to be flexible to an extent. Overall, tplyr
is built around tidy syntax, so all of our object construction supports piping with magrittr (i.e. %>%
).
There are two ways to add layers to a tplyr_table
: add_layer
and add_layers
. The difference is that add_layer
allows you to construct the layer within the call to add_layer
, whereas with add_layers
you can attach multiple layers that have already been constructed upfront:
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories n (%)") )
Within add_layer
, the syntax to constructing the count layer for Age Categories was written on the fly. add_layer
is special in that it also allows you to use piping to use modifier functions on the layer being constructed
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories n (%)") %>% set_format_strings(f_str("xx (xx.x%)", n, pct)) %>% add_total_row() )
add_layers
, on the other hand, lets you isolate the code to construct a particular layer if you wanted to separate things out more. Some might find this cleaner to work with if you have a large number of layers being constructed.
t <- tplyr_table(adsl, TRT01P) l1 <- group_count(t, AGEGR1, by = "Age categories n (%)") l2 <- group_desc(t, AGE, by = "Age (years)") t <- add_layers(t, l1, l2)
Notice that when you construct the layers separately, you need to specify the table to which they belong. add_layer
does this automatically. tplyr_table
and tplyr_layer
objects are built on environments, and the parent/child relationships are very important. This is why, even though the layer knows who its table parent is, the layers still need to be attached to the table (as the table doesn’t know who its children are). Advanced R does a very good job at explaining what environments in R are, their benefits, and how to use them.
Notice that when you construct a tplyr_table
or a tplyr_layer
that what displays is a summary of information about the table or layer? That’s because when you create these objects - it constructs the metadata, but does not process the actual data. This allows you to construct and make sure the pieces of your table fit together before you do the data processing - and it gives you a container to hold all of this metadata, and use it later if necessary.
To generate the data from a tplyr_table
object, you use the function build
:
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories n (%)") ) t %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age categories n (%) | <65 | 14 ( 16.3%) | 11 ( 13.1%) | 8 ( 9.5%) | 1 | 1 | 1 |
Age categories n (%) | >80 | 30 ( 34.9%) | 18 ( 21.4%) | 29 ( 34.5%) | 1 | 1 | 2 |
Age categories n (%) | 65-80 | 42 ( 48.8%) | 55 ( 65.5%) | 47 ( 56.0%) | 1 | 1 | 3 |
But there’s more you can get from Tplyr
. It’s great to have the formatted numbers, but what about the numeric data behind the scenes? What if you want to calculate your own statistics based off of the counts? You can get that information as well using get_numeric_data
. This returns the numeric data from each layer as a list of data frames:
get_numeric_data(t) #> [[1]] #> # A tibble: 9 x 5 #> TRT01P `"Age categories n (%)"` summary_var n total #> <chr> <chr> <chr> <dbl> <int> #> 1 Placebo Age categories n (%) <65 14 86 #> 2 Placebo Age categories n (%) >80 30 86 #> 3 Placebo Age categories n (%) 65-80 42 86 #> 4 Xanomeline High Dose Age categories n (%) <65 11 84 #> 5 Xanomeline High Dose Age categories n (%) >80 18 84 #> 6 Xanomeline High Dose Age categories n (%) 65-80 55 84 #> 7 Xanomeline Low Dose Age categories n (%) <65 8 84 #> 8 Xanomeline Low Dose Age categories n (%) >80 29 84 #> 9 Xanomeline Low Dose Age categories n (%) 65-80 47 84
By storing pertinent information, you can get more out of a Tplyr
objects than processed data for display. And by specifying when you want to get data out of Tplyr
, we can save you from repeatedly processing data while your constructing your outputs - which is particularly useful when that computation starts taking time.
The bulk of Tplyr
coding comes from constructing your layers and specifying the work you want to be done. Before we get into this, it’s important to discuss how Tplyr
handles string formatting.
Tplyr
String formatting in Tplyr
is controlled by an object called an f_str
, which is also the name of function you use to create these formats. To set these format strings into a tplyr_layer
, you use the function set_format_strings
, and this usage varies slightly between layer types (which is covered in other vignettes).
So - why is this object necessary. Consider this example:
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(AGE, by = "Age (years)") %>% set_format_strings( 'n' = f_str('xx', n), 'Mean (SD)' = f_str('xx.xx (xx.xxx)', mean, sd) ) ) t %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age (years) | n | 86 | 84 | 84 | 1 | 1 | 1 |
Age (years) | Mean (SD) | 75.21 ( 8.590) | 74.38 ( 7.886) | 75.67 ( 8.286) | 1 | 1 | 2 |
In a perfect world, the f_str
calls wouldn’t be necessary - but in reality they allow us to infer a great deal of information from very few user inputs. In the calls that you see above:
row_label2
column are taken from the left side of each =
in set_format_strings
f_str
callf_str
call determine the descriptive statistic summaries that will be performed. This is connected to a number of default summaries available within Tplyr
, but you can also create your own summaries (covered in other vignettes). The default summaries that are built in include:
n
= Number of observationsmean
= Meansd
= Standard Deviationvar
= Varianceiqr
= Inter Quartile Rangeq1
= 1st quartileq3
= 3rd quartilemin
= Minimum valuemax
= Maximum valuemissing
= Count of NA valuesf_str
call, then those two summaries are formatted into the same string. This allows you to do a Mean (SD)
type format where both numbers appear.This simple user input controls a significant amount of work in the back end of the data processing, and the f_str
object allows that metadata to be collected.
f_str
objects are also used with count layers as well to control the data presentation. Instead of specifying the summaries performed, you use n
, pct
, distinct
, and distinct_pct
for your parameters and specify how you would like the values displayed. Using distinct
and distinct_pct
must be combined with specifying a distinct_by
variable using set_distinct_by
.
tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories") %>% set_format_strings(f_str('xx (xx.x)',n,pct)) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age categories | <65 | 14 (16.3) | 11 (13.1) | 8 ( 9.5) | 1 | 1 | 1 |
Age categories | >80 | 30 (34.9) | 18 (21.4) | 29 (34.5) | 1 | 1 | 2 |
Age categories | 65-80 | 42 (48.8) | 55 (65.5) | 47 (56.0) | 1 | 1 | 3 |
tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories") %>% set_format_strings(f_str('xx',n)) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age categories | <65 | 14 | 11 | 8 | 1 | 1 | 1 |
Age categories | >80 | 30 | 18 | 29 | 1 | 1 | 2 |
Age categories | 65-80 | 42 | 55 | 47 | 1 | 1 | 3 |
Really - format strings allow you to present your data however you like.
tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories") %>% set_format_strings(f_str('xx (•◡•) xx.x%',n,pct)) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age categories | <65 | 14 (•◡•) 16.3% | 11 (•◡•) 13.1% | 8 (•◡•) 9.5% | 1 | 1 | 1 |
Age categories | >80 | 30 (•◡•) 34.9% | 18 (•◡•) 21.4% | 29 (•◡•) 34.5% | 1 | 1 | 2 |
Age categories | 65-80 | 42 (•◡•) 48.8% | 55 (•◡•) 65.5% | 47 (•◡•) 56.0% | 1 | 1 | 3 |
But should you? Probably not.
As covered under string formatting, set_format_strings
controls a great deal of what happens within a descriptive statistics layer. Note that there are some built in defaults to what’s output:
tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(AGE, by = "Age (years)") ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age (years) | n | 86 | 84 | 84 | 1 | 1 | 1 |
Age (years) | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) | 1 | 1 | 2 |
Age (years) | Median | 76.0 | 76.0 | 77.5 | 1 | 1 | 3 |
Age (years) | Q1, Q3 | 69, 82 | 71, 80 | 71, 82 | 1 | 1 | 4 |
Age (years) | Min, Max | 52, 89 | 56, 88 | 51, 88 | 1 | 1 | 5 |
Age (years) | Missing | 0 | 0 | 0 | 1 | 1 | 6 |
To override these defaults, just specify the summaries that you want to be performed using set_format_strings
as described above. But what if Tplyr
doesn’t have a built in function to do the summary statistic that you want to see? Well - you can make your own! This is where set_custom_summaries
comes into play. Let’s say you want to derive a geometric mean.
tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(AGE, by = "Sepal Length") %>% set_custom_summaries( geometric_mean = exp(sum(log(.var[.var > 0]), na.rm=TRUE) / length(.var)) ) %>% set_format_strings( 'Geometric Mean (SD)' = f_str('xx.xx (xx.xxx)', geometric_mean, sd) ) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Sepal Length | Geometric Mean (SD) | 74.70 ( 8.590) | 73.94 ( 7.886) | 75.18 ( 8.286) | 1 | 1 | 1 |
In set_custom_summaries
, first you name the summary being performed. This is important - that name is what you use in the f_str call to incorporate it into a format. Next, you program or call the function desired. What happens in the background is that this is used in a call to dplyr::summarize
- so use similar syntax. Use the variable name .var
in your custom summary function. This is necessary because it allows a generic variable name to be used when multiple target variables are specified - and therefore the function can be applied to both target variables.
Sometimes there’s a need to present multiple variables summarized side by side. Tplyr
allows you to do this as well.
tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(vars(AGE, AVGDD), by = "Age and Avg. Daily Dose") ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | var2_Placebo | var2_Xanomeline High Dose | var2_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|---|---|---|
Age and Avg. Daily Dose | n | 86 | 84 | 84 | 86 | 84 | 84 | 1 | 1 | 1 |
Age and Avg. Daily Dose | Mean (SD) | 75.2 ( 8.59) | 74.4 ( 7.89) | 75.7 ( 8.29) | 0.0 ( 0.00) | 71.6 ( 8.11) | 54.0 ( 0.00) | 1 | 1 | 2 |
Age and Avg. Daily Dose | Median | 76.0 | 76.0 | 77.5 | 0.0 | 75.1 | 54.0 | 1 | 1 | 3 |
Age and Avg. Daily Dose | Q1, Q3 | 69, 82 | 71, 80 | 71, 82 | 0, 0 | 70, 77 | 54, 54 | 1 | 1 | 4 |
Age and Avg. Daily Dose | Min, Max | 52, 89 | 56, 88 | 51, 88 | 0, 0 | 54, 79 | 54, 54 | 1 | 1 | 5 |
Age and Avg. Daily Dose | Missing | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 6 |
Tplyr
summarizes both variables and merges them together. This makes creating tables where you need to compare BASE, AVAL, and CHG next to each other nice and simple. Note the use of vars
- in any situation where you’d like to use multiple variable names in a parameter, use dplyr::vars
to specify the variables. You can use text strings in the calls to dplyr::vars
as well.
Count layers generally allow you to create n and n (%) count type summaries. There are a few extra features here as well. Let’s say that you want a total row within your counts. This can be down with add_total_row()
:
tplyr_table(adsl, TRT01P) %>% add_layer( group_count(AGEGR1, by = "Age categories") %>% add_total_row() ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
Age categories | <65 | 14 ( 16.3%) | 11 ( 13.1%) | 8 ( 9.5%) | 1 | 1 | 1 |
Age categories | >80 | 30 ( 34.9%) | 18 ( 21.4%) | 29 ( 34.5%) | 1 | 1 | 2 |
Age categories | 65-80 | 42 ( 48.8%) | 55 ( 65.5%) | 47 ( 56.0%) | 1 | 1 | 3 |
Age categories | NA | 86 (100.0%) | 84 (100.0%) | 84 (100.0%) | 1 | 1 | 4 |
Sometimes it’s also necessary to count summaries based on distinct values. Tplyr
allows you to do this as well with set_distinct_by
:
tplyr_table(adae, TRTA) %>% add_layer( group_count('Subjects with at least one adverse event') %>% set_distinct_by(USUBJID) %>% set_format_strings(f_str('xx', n)) ) %>% build() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|
Subjects with at least one adverse event | 47 | 77 | 76 | 1 | NA |
There’s another trick going on here - to create a summary with row label text like you see above, text strings can be used as the target variables. Here, we use this in combination with set_distinct_by
to count distinct subjects.
Adverse event tables often call for counting AEs of something like a body system and counting actual events within that body system. Tplyr
has means of making this simple for the user as well.
tplyr_table(adae, TRTA) %>% add_layer( group_count(vars(AEBODSYS, AEDECOD)) ) %>% build() %>% head() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
CARDIAC DISORDERS | CARDIAC DISORDERS | 5 ( 10.6%) | 6 ( 7.8%) | 6 ( 7.9%) | 1 | 1 | Inf |
CARDIAC DISORDERS | ATRIAL FIBRILLATION | 0 ( 0.0%) | 0 ( 0.0%) | 1 ( 1.3%) | 1 | 1 | 1 |
CARDIAC DISORDERS | ATRIAL FLUTTER | 0 ( 0.0%) | 1 ( 1.3%) | 0 ( 0.0%) | 1 | 1 | 2 |
CARDIAC DISORDERS | ATRIAL HYPERTROPHY | 1 ( 2.1%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 3 |
CARDIAC DISORDERS | BUNDLE BRANCH BLOCK RIGHT | 1 ( 2.1%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 4 |
CARDIAC DISORDERS | CARDIAC FAILURE CONGESTIVE | 1 ( 2.1%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 5 |
Lastly, let’s talk about shift layers. A common example of this would be looking at a subject’s lab levels at baseline versus some designated evaluation point. This would tell us, for example, how many subjects were high at baseline for a lab test vs. after an intervention has been introduced. The shift layer in Tplyr is intended for creating shift tables that show these data as a matrix, where one state will be presented in rows and the other in columns. Let’s look at an example.
# Tplyr can use factor orders to dummy values and order presentation adlb$ANRIND <- factor(adlb$ANRIND, c("L", "N", "H")) adlb$BNRIND <- factor(adlb$BNRIND, c("L", "N", "H")) tplyr_table(adlb, TRTP, where = AVISIT == "Week 2" & PARAMCD == "URATE") %>% add_layer( group_shift(vars(row=BNRIND, column=ANRIND), by=PARAM) %>% set_format_strings(f_str("xx (xxx)", n, pct)) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo_L | var1_Placebo_N | var1_Placebo_H | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|---|
Urate (umol/L) | L | 0 | 0 | 0 | 1 | 28 |
Urate (umol/L) | N | 0 | 3 | 0 | 1 | 28 |
Urate (umol/L) | H | 0 | 0 | 0 | 1 | 28 |
The underlying process of shift tables is the same as count layers - we’re counting the number of occurences of something by a set of grouping variables. This differs in that Tplyr uses the group_shift
API to use the same basic interface as other tables, but translate your target variables into the row variable and the column variable. Furthermore, there is some enhanced control over how denominators should behave that is necessary for a shift layer.