This vignette will demonstrate pmxhelpr
functions for
exploratory data analysis.
First, we will load the required packages.
options(scipen = 999, rmarkdown.html_vignette.check_title = FALSE)
library(pmxhelpr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(Hmisc, warn.conflicts = FALSE)
library(patchwork, warn.conflicts = FALSE)
For this vignette, we will peform exploratory data analysis on the
data_sad
dataset internal to pmxhelpr
. We can
take a quick look at the dataset using glimpse()
from the
dplyr package.
glimpse(data_sad)
#> Rows: 720
#> Columns: 23
#> $ LINE <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
#> $ ID <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
#> $ TIME <dbl> 0.00, 0.00, 0.48, 0.81, 1.49, 2.11, 3.05, 4.14, 5.14, 7.81, 12…
#> $ NTIME <dbl> 0.0, 0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, 8.0, 12.0, 16.0, …
#> $ NDAY <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 1,…
#> $ DOSE <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
#> $ AMT <dbl> NA, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ EVID <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ODV <dbl> NA, NA, NA, 2.02, 4.02, 3.50, 7.18, 9.31, 12.46, 13.43, 12.11,…
#> $ LDV <dbl> NA, NA, NA, 0.7031, 1.3913, 1.2528, 1.9713, 2.2311, 2.5225, 2.…
#> $ CMT <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ MDV <dbl> 1, NA, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1…
#> $ BLQ <dbl> -1, NA, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, …
#> $ LLOQ <dbl> 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ FOOD <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ SEXF <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ RACE <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
#> $ AGEBL <int> 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25…
#> $ WTBL <dbl> 82.1, 82.1, 82.1, 82.1, 82.1, 82.1, 82.1, 82.1, 82.1, 82.1, 82…
#> $ SCRBL <dbl> 0.87, 0.87, 0.87, 0.87, 0.87, 0.87, 0.87, 0.87, 0.87, 0.87, 0.…
#> $ CRCLBL <dbl> 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 12…
#> $ USUBJID <chr> "STUDYNUM-SITENUM-1", "STUDYNUM-SITENUM-1", "STUDYNUM-SITENUM-…
#> $ PART <chr> "Part 1-SAD", "Part 1-SAD", "Part 1-SAD", "Part 1-SAD", "Part …
We can see that this dataset is already formatted for modeling. It contains NONMEM reserved variables (e.g., ID, TIME, AMT, EVID, MDV), as well as, dependent variables of drug concentration in original units (ODV) and natural logarithm transformed units (LDV). In addition to the numeric variables, there are two character variables: USUBJID and PART.
PART specifies the two study cohorts:
unique(data_sad$PART)
#> [1] "Part 1-SAD" "Part 2-FE"
This dataset also contains an exact binning variable:
This variable represents the nominal time of sample collection relative to first dose per study protocol whereas Actual Time (TIME) represents the actual time the sample was collected.
##Unique values of NTIME
ntimes <- unique(data_sad$NTIME)
ntimes
#> [1] 0.0 0.5 1.0 1.5 2.0 3.0 4.0 5.0 8.0 12.0 16.0 24.0
#> [13] 36.0 48.0 72.0 96.0 120.0 144.0 168.0
##Comparison of number of unique values of NTIME and TIME
length(unique(data_sad$NTIME))
#> [1] 19
length(unique(data_sad$TIME))
#> [1] 449
plot_dvtime
Let’s visualize the data. Let’s visualize the data. For this
visualization, we will leverage the functionality of
plot_dvtime
to visualize our data. First, we will filter to
observation records only and derive some factor variables, which can be
passed to the color aesthetic.
plot_data <- data_sad %>%
filter(EVID == 0) %>%
mutate(`Dose (mg)` = factor(DOSE, levels = c(10, 50, 100, 200, 400)),
`Food Status` = factor(FOOD, levels = c(0, 1), labels = c("Fasted", "Fed")))
Now let’s visualize the concentration-time data.
pmxhelpr
includes a function for common visualizations of
observed concentration-time data in exploratory data analysis:
plot_dvtime
In our visualizations, we will leverage the following dataset variables:
ODV
: the original dependent variable (drug
concentration) in untransformed units (ng/mL)TIME
: actual time since first dose (hours)NTIME
: nominal time since first dose (hours)LLOQ
: lower limit of quantification for drug
concentrationplot_dvtime
requires a dependent variable (specified as
named vector using dv_var
argument) and time variable
(specifed as named vector using the time_vars
). The
dependent variable in data_sad
is named "ODV"
,
so we must specify the name using dv_var
. The default names
for the time_vars
are "TIME"
and
"NTIME"
. The color aesthetic is specified using the
col_var
argument. The cent
argument specifies
which central tendency measure is plotted.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)")
Not a bad plot with minimal arguments! We can see the mean for each
dose as a colored thick line and observed data points as colored open
circles with some alpha added. A caption also prints by default
indicating what the plot elements depict. The caption can be removed by
specifying show_caption = FALSE
.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", show_caption = FALSE)
plot_dvtime
includes uses a helper function
(breaks_time
) to automatically determine x-axis breaks
based on the units of the time variable! Two arguments in
plot_dvtime
are passed to breaks_time
:
timeu
character string specifying time units.
Options include:
n_breaks
number breaks requested from the algorithm.
Default = 10.
Let’s pass the vector of nominal times we defined earlier into the
breaks_time
function and see what we get!
breaks_time(ntimes, unit = "hours", n = 5)
#> [1] 0 48 96 144
breaks_time(ntimes, unit = "hours", n = 10)
#> [1] 0 24 48 72 96 120 144 168
breaks_time(ntimes, unit = "hours", n = length(ntimes))
#> [1] 0.0 9.6 19.2 28.8 38.4 48.0 57.6 67.2 76.8 86.4 96.0 105.6
#> [13] 115.2 124.8 134.4 144.0 153.6 163.2
We can see that the default of n = 10 gives an optimal number of
breaks in this case whereas reducing the number of breaks gives a less
optimal distribution of values. Requesting breaks equal to the length of
the vector of unique NTIMES
will generally produce too many
breaks. The default axes breaks behavior can always be overwritten by
specifying the axis breaks manually using
scale_x_continuous()
.
The default n_breaks = 10
is a good value for our
dataset, and data_sad
uses the default time units
("hours"
); therefore, explicit specification of the
n_breaks
and timeu
arguments are not
required.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)")
However, perhaps someone on the team would prefer the x-axis breaks
in units of days
. The x-axis breaks will transform to the
new units automatically as long as we specify the new time unit with
timeu = "days"
.
plot_data_days <- plot_data %>%
mutate(TIME = TIME/24,
NTIME = NTIME/24)
plot_dvtime(data = plot_data_days, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", timeu = "days")
Nice! However, someone else on the team would prefer to see the first
24 hours of treatment in greater detail to visualize the absorption
phase. If we can either truncate the x-axis range using
scale_x_continuous()
, or filter the input data and allow
the x-axis breaks to adjust automatically with the new time range in the
data!
plot_data_24 <- plot_data %>%
filter(NTIME <= 24)
plot_dvtime(data = plot_data_24, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)")
Much better! However, these data are probably best visualized on a
log-scale y-axis to better visualize the terminal phase profile.
plot_dvtime
includes an argument log_y
which
performs this operation with one added benefit over adding in a layer
with scale_y_log10
when using automatically generated plot
captions with show_captions = TRUE
.
plot_dvtime
uses the stat_summary
function
from ggplot2
to calculate and plot the central tendency
measures and error bars. An often overlooked feature of
stat_summary
, is that it calculates the summary statistics
after any transformations to the data performed by changing the
scales. This means that when scale_y_log10()
is applied to
the plot, the data are log-transformed for plotting and the central
tendency measure returned when requesting "mean"
from
stat_summary
is the geometric mean. If the
log_y
argument is used to generate semi-log plots along
with show_captions = TRUE
, then the caption will clearly
delineate where arithmetic and gemoetric means are being returned.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE)
But wait…this plot is potentially misleading! The food effect portion of the study is being pooled together with the fasted data within the 100 mg dose!
Luckily, plot_dvtime
returns a ggplot
object which we can modify like any other ggplot
!
Therefore, we can facet by PART by simply adding in another layer to our
ggplot
object.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE) +
facet_wrap(~PART)
The clinical team would like a simpler plot that clearly displays the
central tendency. We can use the argument cent = "mean_sdl"
to plot the mean with error bars and remove the observed points by
specifying obs_dv = FALSE
.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean_sdl",
ylab = "Concentration (ng/mL)", log_y = TRUE,
obs_dv = FALSE) +
facet_wrap(~PART)
We could also plot these data as median + interquartile range (IQR) using the same approach, if we do not feel the sample size is sufficient for parametric summary statistics.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "median_iqr",
ylab = "Concentration (ng/mL)", log_y = TRUE,
obs_dv = FALSE) +
facet_wrap(~PART)
Hmm…there is some noise at the late terminal phase. This is likely artifact introduced by censoring of data at the assay LLOQ; however, let’s confirm there are no weird individual subject profiles by connecting observed data points longitudinally within a subject - in other words, make a spaghetti plot.
We will change the central tendency measure to the median and add the
spaghetti lines. Data points within an individual value of
grp_var
will be connected by a narrow line when
grp_dv = TRUE
. The default is
grp_var = "ID"
.
plot_dvtime(data = plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "median",
ylab = "Concentration (ng/mL)", log_y = TRUE,
grp_dv = TRUE) +
facet_wrap(~PART)
It does not seem like there are outlier individuals driving the noise in the late terminal phase; therefore, this is almost certainly artifact introduced by data missing due to assay sensitivity and censoring at the lower limit of quantification (LLOQ).
Let’s use imputation to assess the potential impact of the data
missing due to assay sensitivity. plot_dvtime
includes some
functionality to do this imputation for us using the loq
and loq_method
arguments.
The loq_method
argument species how BLQ imputation
should be performed. Options are:
0
: No handling. Plot input dataset DV
vs
TIME
as is. (default)1
: Impute all BLQ data at TIME
<= 0 to
0 and all BLQ data at TIME
> 0 to 1/2 x
loq
. Useful for plotting concentration-time data with some
data BLQ on the linear scale2
: Impute all BLQ data at TIME
<= 0 to
1/2 x loq
and all BLQ data at TIME
> 0 to
1/2 x loq
.The loq
argument species the value of the LLOQ. The
loq
argument must be specified when loq_method
is 1
or 2
, but can be NULL
if the variable LLOQ
is present in the dataset. In
our case, LLOQ
is a variable in plot_data
, so
we do not need to specify the loq
argument (default is
loq = NULL
).
plot_dvtime(plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE,
loq_method = 2) +
facet_wrap(~PART)
The same plot is obtained by specifying loq = 1
plot_dvtime(plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE,
loq_method = 2, loq = 1) +
facet_wrap(~PART)
A reference line is drawn to denote the LLOQ and all observations
with EVID=0
and MDV=1
are imputated as LLOQ/2.
Imputing post-dose concentrations below the lower limit of
quantification as 1/2 x LLOQ normalizes the late terminal phase of the
concentration-time profile. This is confirmatory evidence for our
hypothesis that the noise in the late terminal phase is due to censoring
of observations below the LLOQ.
We can also generate dose-normalized concentration-time plots by
specifying dosenorm = TRUE
.
plot_dvtime(plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Dose-normalized Conc. (ng/mL per mg Drug)", log_y = TRUE,
dosenorm = TRUE) +
facet_wrap(~PART)
When dosenorm = TRUE
, the variable DOSE
needs to be present in the input dataset data
. If
DOSE
is not present in data
, the function will
return an Error with an informative error message.
plot_dvtime(select(plot_data, -DOSE),
dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Dose-normalized Conc. (ng/mL per mg Drug)", log_y = TRUE,
dosenorm = TRUE) +
facet_wrap(~PART)
#> Error in `check_varsindf()`:
#> ! argument `"DOSE"` must be variables in `data`
Dose-normalization is performed AFTER BLQ imputation in the case in which both options are requested. The reference line for the LLOQ will not be plotted when dose-normalized concentration is the dependent variable.
plot_dvtime(plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "mean",
ylab = "Dose-normalized Conc. (ng/mL per mg Drug)", log_y = TRUE,
loq_method = 2, dosenorm = TRUE) +
facet_wrap(~PART)
Only a single variable can be passed to the col_var
argument of plot_dvtime
. Suppose we want to look at the
interaction between two variables in the color aesthetic. This can be
accomplished in ggplot2
using the interaction
function, which computes an unordered factor representing the
interaction between the two variables. Let’s visualize the interaction
between the factor versions of the variables DOSE
and
FOOD
.
ggplot(plot_data, aes(x = TIME, y = ODV, col = interaction(`Dose (mg)`, `Food Status`))) +
geom_point()+
stat_summary(data = plot_data, aes(x = NTIME, y = ODV, col = interaction(`Dose (mg)`, `Food Status`)),
fun.y = "mean", geom = "line") +
scale_x_continuous(breaks = seq(0,168,24)) +
scale_y_log10()+
theme_bw() +
labs(y = "Concentration (ng/mL)", x = "Time (hours)")
The functionality of interaction()
cannot be used within
plot_dvtime
; however, we can reproduce it by formally
creating a variable for the interaction we want to visualize. This also
affords us the opportunity to define the factor labels, levels, and
order formally, which will affect how the interaction is displayed on
the plot.
plot_data_int <- plot_data %>%
mutate(`Dose (mg) and Food` = ifelse(FOOD == 0, paste(DOSE, "mg", "Fasted"), paste(DOSE, "mg", "Fed")),
`Dose (mg) and Food` = factor(`Dose (mg) and Food`, levels = c("10 mg Fasted",
"50 mg Fasted",
"100 mg Fasted",
"100 mg Fed",
"200 mg Fasted",
"400 mg Fasted")))
plot_dvtime(plot_data_int, dv_var = c(DV = "ODV"), col_var = "Dose (mg) and Food", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE,
loq_method = 2)
This looks pretty nice! The legend is formatted cleanly and the
colors are assigned to each unique condition of the interaction.
However, we can actually take this one step further, and define our
interaction variable as an ordered factor, which results
ggplot2
applying the viridis color scale from the
viridisLite
package.
plot_data_int_ordered <- plot_data %>%
mutate(`Dose (mg) and Food` = ifelse(FOOD == 0, paste(DOSE, "mg", "Fasted"), paste(DOSE, "mg", "Fed")),
`Dose (mg) and Food` = factor(`Dose (mg) and Food`, levels = c("10 mg Fasted",
"50 mg Fasted",
"100 mg Fasted",
"100 mg Fed",
"200 mg Fasted",
"400 mg Fasted"),
ordered = TRUE))
plot_dvtime(plot_data_int_ordered, dv_var = c(DV = "ODV"), col_var = "Dose (mg) and Food", cent = "mean",
ylab = "Concentration (ng/mL)", log_y = TRUE,
loq_method = 2)
The same approach can be used to define an interaction variable to be
assigned to the group aesthetic using the grp_var
argument
to plot_dvtime
. Such an approach may be used if we wanted
to visualize the data for a cross-over study condition separately for
each period within an individual. In this case, the default
grp_var = "ID"
would connect all data points within an
individual across both periods whereas we want to visualize points
connected within the individual "ID"
separately by
cross-over period.
To explore this, we will modify data_sad
such that the
same subjects are included in "PART1-SAD"
and
"PART2-FE
(e.g., modify from a parallel group to a
crossover design).
plot_data_crossover <- plot_data %>%
mutate(ID = ifelse(FOOD == 1, ID - 6, ID))
plot_data_crossover %>%
select(ID, DOSE, FOOD) %>%
distinct() %>%
group_by(ID) %>%
filter(max(FOOD) == 1) %>%
arrange(ID, FOOD)
#> # A tibble: 12 × 3
#> # Groups: ID [6]
#> ID DOSE FOOD
#> <dbl> <dbl> <dbl>
#> 1 13 100 0
#> 2 13 100 1
#> 3 14 100 0
#> 4 14 100 1
#> 5 15 100 0
#> 6 15 100 1
#> 7 16 100 0
#> 8 16 100 1
#> 9 17 100 0
#> 10 17 100 1
#> 11 18 100 0
#> 12 18 100 1
Now we have a dataset with a cross-over design for the Food Effect
Part. We can define a factor (or food) ID variable that is the
interaction between "ID"
and "FOOD"
. Now when
we visualize the data, the data points will be connected within the
group defined by both variables.
plot_data_crossover_fid <- plot_data_crossover %>%
mutate(FID = interaction(ID, FOOD),
`Dose (mg) and Food` = ifelse(FOOD == 0, paste(DOSE, "mg", "Fasted"), paste(DOSE, "mg", "Fed")),
`Dose (mg) and Food` = factor(`Dose (mg) and Food`, levels = c("10 mg Fasted",
"50 mg Fasted",
"100 mg Fasted",
"100 mg Fed",
"200 mg Fasted",
"400 mg Fasted"),
ordered = TRUE))
plot_dvtime(plot_data_crossover_fid, dv_var = c(DV = "ODV"), col_var = "Dose (mg) and Food", cent = "mean",
grp_var = "FID", grp_dv = TRUE,
ylab = "Concentration (ng/mL)", log_y = TRUE,
loq_method = 2)
The previous section provides an overview of how to generate
population concentration-time profiles by dose using
plot_dvtime
; however, we can also use
plot_dvtime
to generate subject-level visualizations with a
little pre-processing of the input dataset.
We can specify cent = "none"
to remove the central
tendency layer when plotting individual subject data.
plot_dvtime(plot_data, dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "none",
ylab = "Concentration (ng/mL)", log_y = TRUE,
grp_dv = TRUE,
loq_method = 2, loq = 1) +
facet_wrap(~PART)
We can plot an individual subject by filtering the input dataset.
This could be extended generate plots for all individuals using
for
loops, purrr::map()
functions, or other
vectorized methods.
ids <- sort(unique(plot_data$ID))
plist<- list()
for(i in 1:length(ids)){
plist[[i]] <- plot_dvtime(filter(plot_data, ID == ids[i]),
dv_var = c(DV = "ODV"), col_var = "Dose (mg)", cent = "none",
ylab = "Concentration (ng/mL)", log_y = TRUE,
grp_dv = TRUE,
loq_method = 2, loq = 1, show_caption = FALSE) +
facet_wrap(~PART)+
labs(title = paste0("ID = ", ids[i], " | Dose = ", unique(plot_data$DOSE[plot_data$ID==ids[i]]), " mg"))+
theme(legend.position="none")
}
groups <- length(plist)/4
grpplist <- list()
for(grp in 1:groups){
i <- (grp-1)*4+1
j <- grp*4
tmplist <- plist[i:j]
grpplist[[grp]] <- wrap_plots(tmplist)
}
grpplist
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
#>
#> [[4]]
#>
#> [[5]]
#>
#> [[6]]
#>
#> [[7]]
#>
#> [[8]]
#>
#> [[9]]