Three weeks ago, we released the Alpha release of Tplyr. Since then, we’ve been hard at work introducing some fresh features to make Tplyr more flexible, more effective, and more useful as a tool for you and your organization. Some of these additions are less alluring from the surface, but introduce critical functionality needed for Tplyr to accomplish the goals we’ve laid out.
The enhancements that we’re going to cover in this document are as follows: - General updates: - Calculate your header N counts based on the population dataset or the target dataset. The alpha release had an option to set the population data but this wasn’t actually used anywhere in the internals. - Use these header N counts as token replacements when using the add_column_headers
function. - Order variables are now added to the built dataset to allow you to sort the output dataset as you wish with numeric variables. - Count layer updates: - Optionally use the population data N counts as denominators for percent calculation. - For multi-level count summaries, nest the row label columns together to present both row labels in a single column - You can now present both distinct and non-distinct counts instead of one or the other - Sorting options allow you to order results from the target variable values or from derived counts within a specified column - Risk difference calculations can now be added as additional columns, with flexible options for presentation - Descriptive statistics layer updates: - The custom summary functionality has been updated to apply to multi-variable summaries, which results in an interface change - Automatic decimal precision has been added to allow you to base the presentation on the precision of the data as collected
Let’s dig into each of these updates one by one.
The general updates are best presented within the context of a count layer. An important feature that was missing from the alpha release was the capability to control the denominator being used in a summary. For example, if you’re summarizing adverse events, not all subjects may have been included adverse events dataset. A subject only exists in the adverse events dataset if they had an adverse event. Therefore, if you’re counting the total subjects for your denominator, the denominator will be missing subjects. This is exactly why we included bindings for a separate population dataset in the first place, and now the population data are properly utilized.
Consider the following:
t <- tplyr_table(adae, TRTA, where= SAFFL == "Y") %>% set_pop_data(adsl) %>% # Specify the population dataset set_pop_treat_var(TRT01A) %>% # Specify the treatment variable within the population add_layer( group_count(AEDECOD) %>% # Create the count layer set_distinct_by(USUBJID) %>% # Specify the variable to determine a distinct count by set_format_strings(f_str('xx (xx.x%)', distinct, distinct_pct)) # Set up the presentation ) invisible(build(t)) header_n(t) #> # A tibble: 3 x 2 #> TRT01A n #> <fct> <dbl> #> 1 Placebo 86 #> 2 Xanomeline High Dose 84 #> 3 Xanomeline Low Dose 84
You can see that these counts are indeed coming from ADSL and not ADAE:
adae %>% filter(SAFFL == "Y") %>% distinct(TRTA, USUBJID) %>% count(TRTA) #> # A tibble: 3 x 2 #> TRTA n #> <chr> <int> #> 1 Placebo 21 #> 2 Xanomeline High Dose 42 #> 3 Xanomeline Low Dose 42
Note also the use of the new f_str
variable names distinct
and distinct_pct
. We made these names separate from n
and pct
to make it explicit when you want to use one type of count or the other. This further allows you to create tables where you mix and match. Consider this common AE table structure:
t <- tplyr_table(adae, TRTA, where= SAFFL == "Y") %>% set_pop_data(adsl) %>% # Specify the population dataset set_pop_treat_var(TRT01A) %>% # Specify the treatment variable within the population add_layer( group_count(AEDECOD) %>% # Create the count layer set_distinct_by(USUBJID) %>% # Specify the variable to determine a distinct count by set_format_strings(f_str('xx (xx.x%) [x]', distinct, distinct_pct, n)) # Set up the presentation - with event counts ) kable(head(build(t)))
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|
ACTINIC KERATOSIS | 0 ( 0.0%) [0] | 1 ( 2.4%) [1] | 0 ( 0.0%) [0] | 1 | 1 |
ALOPECIA | 1 ( 4.8%) [1] | 0 ( 0.0%) [0] | 0 ( 0.0%) [0] | 1 | 2 |
BLISTER | 0 ( 0.0%) [0] | 1 ( 2.4%) [2] | 5 (11.9%) [8] | 1 | 3 |
COLD SWEAT | 1 ( 4.8%) [3] | 0 ( 0.0%) [0] | 0 ( 0.0%) [0] | 1 | 4 |
DERMATITIS ATOPIC | 1 ( 4.8%) [1] | 0 ( 0.0%) [0] | 0 ( 0.0%) [0] | 1 | 5 |
DERMATITIS CONTACT | 0 ( 0.0%) [0] | 0 ( 0.0%) [0] | 1 ( 2.4%) [2] | 1 | 6 |
In this table, we present the distinct counts by subject, the distinct percent (i.e. number of subjects who experienced an adverse event within the total treatment group, as determined by the population dataset), and then the total number of events. When the population dataset is specified, those N counts from each treatment group will be used within the denominator for distinct_pct
. Otherwise, the count within target dataset (in this case, ADAE), of the subjects will be used for distinct_pct
.
In addition to population data updates, we’ve also enhanced the capabilities of add_column_headers
to work with the header_n
values. It’s extremely common to print the header_n counts within your column headers. Now, we’ve made that easy.
t <- tplyr_table(adae, TRTA, where= SAFFL == "Y") %>% set_pop_data(adsl) %>% # Specify the population dataset set_pop_treat_var(TRT01A) %>% # Specify the treatment variable within the population set_pop_where(SAFFL == "Y") %>% add_layer( group_count(AEDECOD) %>% # Create the count layer set_distinct_by(USUBJID) %>% # Specify the variable to determine a distinct count by set_format_strings(f_str('xx (xx.x%)', distinct, distinct_pct)) # Set up the presentation ) build(t) %>% select(-starts_with('ord')) %>% add_column_headers(' | Placebo N=(**Placebo**) | Xan.Low N=(**Xanomeline High Dose**) | Xan.High N=(**Xanomeline Low Dose**)', header_n(t)) %>% head() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose |
---|---|---|---|
Placebo N=(86) | Xan.Low N=(84) | Xan.High N=(84) | |
ACTINIC KERATOSIS | 0 ( 0.0%) | 1 ( 2.4%) | 0 ( 0.0%) |
ALOPECIA | 1 ( 4.8%) | 0 ( 0.0%) | 0 ( 0.0%) |
BLISTER | 0 ( 0.0%) | 1 ( 2.4%) | 5 (11.9%) |
COLD SWEAT | 1 ( 4.8%) | 0 ( 0.0%) | 0 ( 0.0%) |
DERMATITIS ATOPIC | 1 ( 4.8%) | 0 ( 0.0%) | 0 ( 0.0%) |
After the table is built and you order your columns, using add_column_headers
, you can extract the header_n’s from your tplyr_table
object using the header_n
function. Using this as a parameter to add_column_headers
, the function will now use the names and values from the header_n
output as token replacements. Simply use the names of the header_n
vector as your token, surrounded by "**" on both sides. Those values will be replaced as you see in the output above. Any named numeric vector will work in the header_n
parameter of add_column_headers
, but the header_n
function in Tplyr makes this simple.
You may have noticed the addition of ord
columns on the datasets output so far. Tplyr now provides order columns built into the table returned to you via the build
function. This follows a few basic principles:
tplyr_table
object.by
variables will be sorted based on:
by
variable will be used.All layers will function similarly up until this point. The sorting of results is where things differ based on the layer type. Descriptive statistic layers are simple - the order in which the format strings are presented controls the order in which the results will be displayed. This means that if you enter the format strings in the order ‘n’, ‘Mean (SD)’, ‘Median’ - then the rows will order to present the summaries in that order.
Count layers are where things get more complicated. There are multiple scenarios for how one might wish to sort a count table. You may want:
Tplyr’s got your back! We support each of these different scenarios. Let’s use count layers to explore how all of this works
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(RACE, by = SEX) ) %>% add_layer( group_count(ETHNIC, by = SEX) ) build(t) %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
F | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 1 |
F | BLACK OR AFRICAN AMERICAN | 5 ( 5.8%) | 6 ( 7.1%) | 6 ( 7.1%) | 1 | 1 | 2 |
F | WHITE | 48 ( 55.8%) | 34 ( 40.5%) | 44 ( 52.4%) | 1 | 1 | 3 |
M | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 1 ( 1.2%) | 0 ( 0.0%) | 1 | 2 | 1 |
M | BLACK OR AFRICAN AMERICAN | 3 ( 3.5%) | 3 ( 3.6%) | 0 ( 0.0%) | 1 | 2 | 2 |
M | WHITE | 30 ( 34.9%) | 40 ( 47.6%) | 34 ( 40.5%) | 1 | 2 | 3 |
F | HISPANIC OR LATINO | 2 ( 2.3%) | 1 ( 1.2%) | 4 ( 4.8%) | 2 | 1 | 1 |
F | NOT HISPANIC OR LATINO | 51 ( 59.3%) | 39 ( 46.4%) | 46 ( 54.8%) | 2 | 1 | 2 |
M | HISPANIC OR LATINO | 1 ( 1.2%) | 2 ( 2.4%) | 2 ( 2.4%) | 2 | 2 | 1 |
M | NOT HISPANIC OR LATINO | 32 ( 37.2%) | 42 ( 50.0%) | 32 ( 38.1%) | 2 | 2 | 2 |
In this very simple example, you see two order columns:
t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(RACE, by = SEX) %>% set_order_count_method('byvarn') # Specify to look for a RACEN for result sorting ) build(t) %>% arrange(ord_layer_index, ord_layer_1, ord_layer_2) %>% # Sort using the Tplyr order variables kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
F | WHITE | 48 ( 55.8%) | 34 ( 40.5%) | 44 ( 52.4%) | 1 | 1 | 1 |
F | BLACK OR AFRICAN AMERICAN | 5 ( 5.8%) | 6 ( 7.1%) | 6 ( 7.1%) | 1 | 1 | 2 |
F | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 6 |
M | WHITE | 30 ( 34.9%) | 40 ( 47.6%) | 34 ( 40.5%) | 1 | 2 | 1 |
M | BLACK OR AFRICAN AMERICAN | 3 ( 3.5%) | 3 ( 3.6%) | 0 ( 0.0%) | 1 | 2 | 2 |
M | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 1 ( 1.2%) | 0 ( 0.0%) | 1 | 2 | 6 |
RACEN already exists within ADSL. Using the byvarn
method of sorting, ord_layer_2
is built by simply using the RACEN variable values. This allows you to leverage common variable values already built into CDISC. If you’d like to work in a more R like world, then converting your input variables to ordered factors works as well:
# Create a ordered factor for RACE adsl$RACE <- factor(adsl$RACE, c('BLACK OR AFRICAN AMERICAN', 'AMERICAN INDIAN OR ALASKA NATIVE', 'WHITE')) t <- tplyr_table(adsl, TRT01P) %>% add_layer( group_count(RACE, by = SEX) %>% set_order_count_method('byfactor') # Specify to use RACE's factor order for the Tplyr order variable ) build(t) %>% arrange(ord_layer_index, ord_layer_1, ord_layer_2) %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
F | BLACK OR AFRICAN AMERICAN | 5 ( 5.8%) | 6 ( 7.1%) | 6 ( 7.1%) | 1 | 1 | 1 |
F | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 2 |
F | WHITE | 48 ( 55.8%) | 34 ( 40.5%) | 44 ( 52.4%) | 1 | 1 | 3 |
M | BLACK OR AFRICAN AMERICAN | 3 ( 3.5%) | 3 ( 3.6%) | 0 ( 0.0%) | 1 | 2 | 1 |
M | AMERICAN INDIAN OR ALASKA NATIVE | 0 ( 0.0%) | 1 ( 1.2%) | 0 ( 0.0%) | 1 | 2 | 2 |
M | WHITE | 30 ( 34.9%) | 40 ( 47.6%) | 34 ( 40.5%) | 1 | 2 | 3 |
And finally, let’s jump to an AE table for the last example - sorting based on the calculated counts. Here, we need some more input from the user to understand what they want:
Let’s take a look.
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(AEDECOD) %>% set_distinct_by(USUBJID) %>% set_format_strings(f_str('xx (xx%) [x]', distinct, pct, n)) %>% set_order_count_method('bycount') %>% # Specify to use the resulting counts for sorting set_ordering_cols('Xanomeline High Dose') %>% # Use the counts from Xanomeline High Dose treatment group set_result_order_var(n) # Use the raw numeric value of the n counts ) build(t) %>% head() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 |
---|---|---|---|---|---|
ACTINIC KERATOSIS | 0 ( 0%) [0] | 1 ( 1%) [1] | 0 ( 0%) [0] | 1 | 1 |
ALOPECIA | 1 ( 2%) [1] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 0 |
BLISTER | 0 ( 0%) [0] | 1 ( 2%) [2] | 5 ( 7%) [8] | 1 | 2 |
COLD SWEAT | 1 ( 6%) [3] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 0 |
DERMATITIS ATOPIC | 1 ( 2%) [1] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 0 |
DERMATITIS CONTACT | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 ( 2%) [2] | 1 | 0 |
Take a look at order_layer_1
and compare it to the values within var1_Xanomeline High Dose
in the third position. Tplyr pulled out the raw numeric values, so you can use them to sort. Looking through each of the methods:
bycount
method of sorting. This tells Tplyr that you’re looking for the raw numeric values.cols
argument, simply provide multiple values, first by using the treatment group value desired, and then the cols
arguments in order.Let’s look at one last situation - multi-level count summaries.
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(vars(AEBODSYS, AEDECOD)) %>% # Now doing a multi-level summary set_distinct_by(USUBJID) %>% set_format_strings(f_str('xx (xx%) [x]', distinct, pct, n)) %>% set_order_count_method('bycount') %>% # Specify to use the resulting counts for sorting set_ordering_cols('Xanomeline High Dose') %>% # Use the counts from Xanomeline High Dose treatment group set_result_order_var(n) # Use the raw numeric value of the n counts ) build(t) %>% head() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | SKIN AND SUBCUTANEOUS TISSUE DISORDERS | 21 (100%) [47] | 42 (100%) [111] | 42 (100%) [118] | 1 | 111 | Inf |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | ACTINIC KERATOSIS | 0 ( 0%) [0] | 1 ( 1%) [1] | 0 ( 0%) [0] | 1 | 111 | 1 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | ALOPECIA | 1 ( 2%) [1] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 111 | 0 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | BLISTER | 0 ( 0%) [0] | 1 ( 2%) [2] | 5 ( 7%) [8] | 1 | 111 | 2 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | COLD SWEAT | 1 ( 6%) [3] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 111 | 0 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | DERMATITIS ATOPIC | 1 ( 2%) [1] | 0 ( 0%) [0] | 0 ( 0%) [0] | 1 | 111 | 0 |
Tplyr can handle these as well, but this is an area we’re going to improve in the next release. Currently, Tplyr defaults to using the VARN/Factor/alphanumeric sorting method for the outer variable, AEBODSYS (just like the by
variables, starting with searching for a <VAR>N variable, then to factor variable, last to alphanumeric). For the inside variable, the specified sorting method is used. This currently covers many situations just fine - but we’re going to add the additional flexibility to allow count based sorting for both the inside and outside variable.
Figuring out how we wanted to approach sorting was itself quite a large task, as we constantly try to balance flexibility with complexity of use. Implementing these order methods was an even larger task. Kudos to Eli for getting this as far as we have!
While we’re on the subject of multi-level counts, we’ve also added some enhanced capabilities for presenting these situations by offering nesting - as you can see in the summary above. The second variable is nested inside the first. You’re able to turn this setting off:
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(vars(AEBODSYS, AEDECOD)) %>% set_nest_count(FALSE) ) build(t) %>% head() #> # A tibble: 6 x 8 #> row_label1 row_label2 var1_Placebo `var1_Xanomelin… `var1_Xanomelin… #> <chr> <chr> <chr> <chr> <chr> #> 1 SKIN AND … "SKIN AND… " 47 (100.0… "111 (100.0%)" "118 (100.0%)" #> 2 SKIN AND … " ACTIN… " 0 ( 0.0… " 1 ( 0.9%)" " 0 ( 0.0%)" #> 3 SKIN AND … " ALOPE… " 1 ( 2.1… " 0 ( 0.0%)" " 0 ( 0.0%)" #> 4 SKIN AND … " BLIST… " 0 ( 0.0… " 2 ( 1.8%)" " 8 ( 6.8%)" #> 5 SKIN AND … " COLD … " 3 ( 6.4… " 0 ( 0.0%)" " 0 ( 0.0%)" #> 6 SKIN AND … " DERMA… " 1 ( 2.1… " 0 ( 0.0%)" " 0 ( 0.0%)" #> # … with 3 more variables: ord_layer_index <int>, ord_layer_1 <int>, #> # ord_layer_2 <dbl>
You can also change the character used to set the indentation, which defaults to \t
for a tab.
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(vars(AEBODSYS, AEDECOD)) ) %>% set_indentation("--->") build(t) %>% head() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | SKIN AND SUBCUTANEOUS TISSUE DISORDERS | 47 (100.0%) | 111 (100.0%) | 118 (100.0%) | 1 | 1 | Inf |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | —>ACTINIC KERATOSIS | 0 ( 0.0%) | 1 ( 0.9%) | 0 ( 0.0%) | 1 | 1 | 1 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | —>ALOPECIA | 1 ( 2.1%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 2 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | —>BLISTER | 0 ( 0.0%) | 2 ( 1.8%) | 8 ( 6.8%) | 1 | 1 | 3 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | —>COLD SWEAT | 3 ( 6.4%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 4 |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | —>DERMATITIS ATOPIC | 1 ( 2.1%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 1 | 5 |
Tplyr does not support, nor do we intended to support, a wide array of statistical methods. Our goal is rather to take your focus as an analyst off the mundane summaries so you can focus on the interesting analysis. That said, there are some things that are common enough that we feel that it’s reasonable for us to include. So let’s take a look at risk difference.
Our current implementation of risk difference is solely built on top of the base R function prop.test
. For any and all questions about this method, please review the prop.test
documentation within R.
Risk difference is built on top of count layers, as it’s a comparison of proportions. To add risk difference into a count layer, you simply use the function add_risk_diff
. We made a large effort to make this flow very naturally with the count layer construction, so let’s walk through it step by step.
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(AEDECOD) %>% set_distinct_by('USUBJID') %>% add_risk_diff( c('Xanomeline High Dose', 'Placebo'), c('Xanomeline Low Dose', 'Placebo') ) ) suppressWarnings(build(t)) %>% head() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | rdiff_Xanomeline High Dose_Placebo | rdiff_Xanomeline Low Dose_Placebo | ord_layer_1 |
---|---|---|---|---|---|---|---|
ACTINIC KERATOSIS | 0 ( 0.0%) | 1 ( Inf%) | 0 ( 0.0%) | 1 | 1 | ||
ALOPECIA | 1 (100.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 2 | ||
BLISTER | 0 ( 0.0%) | 1 ( Inf%) | 1 ( Inf%) | 1 | 3 | ||
COLD SWEAT | 1 (100.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 4 | ||
DERMATITIS ATOPIC | 1 (100.0%) | 0 ( 0.0%) | 0 ( 0.0%) | 1 | 5 | ||
DERMATITIS CONTACT | 0 ( 0.0%) | 0 ( 0.0%) | 1 ( Inf%) | 1 | 6 |
Comparisons are specified with two-element character vectors. These are simply your comparison group - the first element, and your reference group - the second. This coincides with how you might see risk difference specified in the header of your mock, where you’ll see something like T1-Placebo. You can provide as many comparisons as you want - the values specified in the comparison just need to be valid treatment groups within your data. This works with any treatment group built using add_treat_group
or add_total_group
as well.
The risk difference calculations are displayed in the rdiff
columns. There will be an rdiff
column for every comparison that is made, following the convention rdiff_<comparison>_<reference>
.
Note the use of suppressWarnings
- if the counts used in prop.test
are too low, you’ll get a warning that says “Chi-squared approximation may be incorrect” for every time prop.test
is run with counts that are too low… This could happen a lot, but the warning is perfectly valid.
The default values that are displayed will be:
You have a good bit of control over these values though, and this can be controlled in the same way you format the count summaries - using set_format_strings
.
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(AEDECOD) %>% set_distinct_by('USUBJID') %>% add_risk_diff( c('Xanomeline High Dose', 'Placebo'), c('Xanomeline Low Dose', 'Placebo') ) %>% set_format_strings( 'n_counts' = f_str('xx (xx.x) [x]', distinct, distinct_pct, n), 'riskdiff' = f_str('xx.xxx, xx.xxx, xx.xxx, xx.xxx, xx.xxx', comp, ref, dif, low, high) ) ) suppressWarnings(build(t)) %>% head() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | rdiff_Xanomeline High Dose_Placebo | rdiff_Xanomeline Low Dose_Placebo | ord_layer_1 |
---|---|---|---|---|---|---|---|
ACTINIC KERATOSIS | 0 ( 0.0) [0] | 1 ( Inf) [1] | 0 ( 0.0) [0] | 1 | 1 | ||
ALOPECIA | 1 (100.0) [1] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 2 | ||
BLISTER | 0 ( 0.0) [0] | 1 ( Inf) [2] | 1 ( Inf) [8] | 1 | 3 | ||
COLD SWEAT | 1 (100.0) [3] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 4 | ||
DERMATITIS ATOPIC | 1 (100.0) [1] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 5 | ||
DERMATITIS CONTACT | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 ( Inf) [2] | 1 | 6 |
Take a look at the rdiff
columns now - you’ll see they have 5 values. These are:
prop.test
object)prop.test
object)You have the same control over the formatting of the display of these values here as you do with the count summaries. Taking things a step further, you can also pass forward arguments to prop.test
using a named list and the args
argument in add_risk_diff
. This wasn’t done using the ellipsis (i.e. ...
) like typical R functions because it’s already used to capture a varying number of comparisons, but it’s not much more difficult to use:
t <- tplyr_table(adae, TRTA) %>% add_layer( group_count(AEDECOD) %>% set_distinct_by('USUBJID') %>% add_risk_diff( c('Xanomeline High Dose', 'Placebo'), c('Xanomeline Low Dose', 'Placebo'), args = list(conf.level=0.95, alternative='less', correct=FALSE) ) %>% set_format_strings( 'n_counts' = f_str('xx (xx.x) [x]', distinct, distinct_pct, n), 'riskdiff' = f_str('xx.xxx, xx.xxx, xx.xxx, xx.xxx, xx.xxx', comp, ref, dif, low, high) ) ) suppressWarnings(build(t)) %>% head() %>% kable()
row_label1 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | rdiff_Xanomeline High Dose_Placebo | rdiff_Xanomeline Low Dose_Placebo | ord_layer_1 |
---|---|---|---|---|---|---|---|
ACTINIC KERATOSIS | 0 ( 0.0) [0] | 1 ( Inf) [1] | 0 ( 0.0) [0] | 1 | 1 | ||
ALOPECIA | 1 (100.0) [1] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 2 | ||
BLISTER | 0 ( 0.0) [0] | 1 ( Inf) [2] | 1 ( Inf) [8] | 1 | 3 | ||
COLD SWEAT | 1 (100.0) [3] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 4 | ||
DERMATITIS ATOPIC | 1 (100.0) [1] | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 | 5 | ||
DERMATITIS CONTACT | 0 ( 0.0) [0] | 0 ( 0.0) [0] | 1 ( Inf) [2] | 1 | 6 |
One more note - the default of add_risk_diff
works on the distinct counts available within the count summary. If for whatever reason you’d like to run risk difference on the non-distinct counts, switch the distinct
argument to FALSE. add_risk_diff
also will function on multi-level summaries no different than single level, so no concerns there either.
You can see that there was a lot of attention given to counting over the past three weeks - but descriptive statistics weren’t neglected either. Let’s start simple - custom summaries no work properly on multi-variable summaries:
tplyr_table(adsl, TRT01P) %>% add_layer( group_desc(vars(AGE, HEIGHTBL), by = "Sepal Length") %>% set_custom_summaries( geometric_mean = exp(sum(log(.var[.var > 0]), na.rm=TRUE) / length(.var)) ) %>% set_format_strings( 'Geometric Mean (SD)' = f_str('xx.xx (xx.xxx)', geometric_mean, sd) ) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | var2_Placebo | var2_Xanomeline High Dose | var2_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|---|---|---|
Sepal Length | Geometric Mean (SD) | 74.70 ( 8.590) | 73.94 ( 7.886) | 75.18 ( 8.286) | 162.17 (11.522) | 165.51 (10.131) | 163.11 (10.419) | 1 | 1 | 1 |
Not much more to it! Just use .var
instead of the a distinct variable name.
The more interesting development for descriptive statistics was the addition of auto-precision. Auto-precision allows you to format your numeric summaries based on the precision of the data collected. Particularly when working with labs results, different tests may have difference necessities for decimal precision depending on the numeric range of the tests, the units the data are collected in, etc. So it is common practice to vary the precision of the data being presented based on the data collected. Furthermore, depending on the summary being presented, you may wish to increase the precision further. For example, you may want the mean to be at collected precision +1 decimal place, for standard deviation +2.
Tplyr now handles these cases, and in Tplyr style, it’s intuitive and easy to control. This has all been built into the format strings, because a natural place to specify your desired format is where you specify how you want your data presented. Now - if you wish to use auto-precision, use a
instead of x
when creating your summaries. Note that only one a
is needed. To use increased precision, use a+n
where n
is the number of additional spaces you wish to add.
tplyr_table(adlb, TRTA) %>% add_layer( group_desc(AVAL, by = PARAMCD) %>% set_format_strings( 'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd) ) ) %>% build() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
BUN | Mean (SD) | 5.3058 ( 2.05463) | 4.6070 ( 1.30148) | 6.7320 ( 2.94018) | 1 | 1 | 1 |
CA | Mean (SD) | 2.180413 (0.0692494) | 2.204333 (0.1372011) | 2.161054 (0.0830867) | 1 | 2 | 1 |
CK | Mean (SD) | 175.8 ( 288.41) | 108.2 ( 93.99) | 83.1 ( 77.91) | 1 | 3 | 1 |
GGT | Mean (SD) | 34.5 ( 34.77) | 36.0 ( 48.69) | 34.9 ( 26.99) | 1 | 4 | 1 |
URATE | Mean (SD) | 260.6499 ( 83.69662) | 289.2606 ( 88.16093) | 253.0604 ( 87.00646) | 1 | 5 | 1 |
As you can see, the decimal precision is now varying depending on the test being performed. Notice that both the integer and the decimal side of each number fluctuate as well. Tpylr collects both the integer and decimal precision, and you can specify both separately. For example, you could use x
’s to specify a default number of spaces for your integers that are used consistently across by variables, but vary the decimal precision based on collected data. You can also increment the number of spaces for both integer and decimal separately.
But - this is kind of ugly, isn’t it? Do we really need all 5 decimal places collected for CA? For this reason, you’re able to set a cap on the precision that’s displayed:
tplyr_table(adlb, TRTA) %>% add_layer( group_desc(AVAL, by = PARAMCD) %>% set_format_strings( 'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd), cap = c(int=3, dec=2) ) ) %>% build() %>% head() %>% kable()
row_label1 | row_label2 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 |
---|---|---|---|---|---|---|---|
BUN | Mean (SD) | 5.306 ( 2.0546) | 4.607 ( 1.3015) | 6.732 ( 2.9402) | 1 | 1 | 1 |
CA | Mean (SD) | 2.180 (0.0692) | 2.204 (0.1372) | 2.161 (0.0831) | 1 | 2 | 1 |
CK | Mean (SD) | 175.8 (288.41) | 108.2 ( 93.99) | 83.1 ( 77.91) | 1 | 3 | 1 |
GGT | Mean (SD) | 34.5 ( 34.77) | 36.0 ( 48.69) | 34.9 ( 26.99) | 1 | 4 | 1 |
URATE | Mean (SD) | 260.650 ( 83.6966) | 289.261 ( 88.1609) | 253.060 ( 87.0065) | 1 | 5 | 1 |
Now that looks better. The cap
argument is part of set_format_strings
. You need to specify the integer and decimal caps separately. Note that integer precision might not behave like you expect - it doesn’t make sense to truncate an integer if it’s value is too high, so if the integer exceeds the allotted space, then the length of the string will increase and the full value will be displayed. But values that are short enough will only pad to the capped number of spaces. We plan to implement a warning in future releases if integers exceed the set display space allocation.
This was a basic situation, but if you’re paying close attention, you may have some questions. What if you have more by variables, like by visit AND test. Do we then calculate precision by visit and test? What if collected precision is different per visit and we don’t want that? What about multiple summary variable? How do we determine precision then? We have modifier functions for this:
tplyr_table(adlb, TRTA, where = SAFFL=='Y' & AVISIT != '') %>% add_layer( group_desc(vars(AVAL, CHG, BASE), by = vars(AVISIT,PARAMCD)) %>% set_format_strings( 'Mean (SD)' = f_str('a.a+1 (a.a+2)', mean, sd), cap = c(int=3, dec=2) ) %>% set_precision_on(AVAL) %>% set_precision_by(PARAMCD) ) %>% build() %>% head() %>% kable()
row_label1 | row_label2 | row_label3 | var1_Placebo | var1_Xanomeline High Dose | var1_Xanomeline Low Dose | var2_Placebo | var2_Xanomeline High Dose | var2_Xanomeline Low Dose | var3_Placebo | var3_Xanomeline High Dose | var3_Xanomeline Low Dose | ord_layer_index | ord_layer_1 | ord_layer_2 | ord_layer_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Week 12 | BUN | Mean (SD) | 4.845 ( 1.7833) | 4.106 ( 1.2106) | 6.640 ( 2.5319) | -0.714 ( 0.7432) | -1.606 ( 1.4763) | -0.357 ( 1.7731) | 5.559 ( 1.5266) | 5.712 ( 2.3464) | 6.515 ( 1.2150) | 1 | 12 | 1 | 1 |
Week 12 | CA | Mean (SD) | 2.191 (0.0747) | 2.181 (0.0626) | 2.187 (0.0929) | -0.083 (0.0968) | -0.114 (0.0997) | -0.150 (0.0651) | 2.275 (0.1276) | 2.295 (0.1019) | 2.337 (0.0700) | 1 | 12 | 2 | 1 |
Week 12 | CK | Mean (SD) | 140.0 (148.57) | 140.1 (115.50) | 120.0 (104.57) | 48.1 (137.19) | 24.9 ( 97.08) | 25.4 ( 66.39) | 91.9 ( 48.64) | 115.2 ( 98.26) | 94.6 ( 49.49) | 1 | 12 | 3 | 1 |
Week 12 | GGT | Mean (SD) | 29.2 ( 20.69) | 39.5 ( 55.69) | 22.2 ( 14.96) | 0.0 ( 8.12) | 2.3 ( 15.15) | -0.2 ( 9.88) | 29.2 ( 13.50) | 37.2 ( 40.78) | 22.4 ( 6.02) | 1 | 12 | 4 | 1 |
Week 12 | URATE | Mean (SD) | 298.887 (105.3868) | 304.835 ( 85.1772) | 218.589 ( 41.6006) | -5.205 ( 36.4499) | -5.204 ( 44.2188) | -31.227 ( 27.2030) | 304.091 ( 80.8351) | 310.039 ( 62.2564) | 249.816 ( 65.6978) | 1 | 12 | 5 | 1 |
Week 24 | BUN | Mean (SD) | 6.854 ( 2.9482) | 4.284 ( 1.2872) | 6.604 ( 3.5160) | 0.500 ( 2.9178) | -0.119 ( 0.5453) | 0.238 ( 1.4428) | 6.355 ( 1.6243) | 4.403 ( 1.6098) | 7.378 ( 2.6795) | 1 | 24 | 1 | 2 |
Three variables are being summarized here - AVAL, CHG, and BASE. So which should be used for precision? set_precision_on
allows you to specify this, where the precision_on
variable must be one of the variables within target_var
. Similarly, set_precision_by
changes the by
variables used to determine collected precision. If no precision_on
variable is specified, the first variable in target_var
is used. If not precision_by
variables are specified, then the default by
variables are used.
Our next release of Tplyr will be our first full release. This means that:
The next wave of updates is mostly going to be tweaks and sugar, where we will try to make sure that Tplyr is as user friendly and practical as possible. In the mean time, if you have any comments, feedback, or find some bugs - drop us an issue. We’d love to hear any feedback if you’ve taken Tplyr for a test drive.