This function is a wrapper that calls specify()
, hypothesize()
, and
calculate()
consecutively that can be used to calculate observed
statistics from data. hypothesize()
will only be called if a point
null hypothesis parameter is supplied.
Learn more in vignette("infer")
.
observe(
x,
formula,
response = NULL,
explanatory = NULL,
success = NULL,
null = NULL,
p = NULL,
mu = NULL,
med = NULL,
sigma = NULL,
stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means",
"diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z",
"ratio of props", "odds ratio"),
order = NULL,
...
)
A data frame that can be coerced into a tibble.
A formula with the response variable on the left and the
explanatory on the right. Alternatively, a response
and explanatory
argument can be supplied.
The variable name in x
that will serve as the response.
This is an alternative to using the formula
argument.
The variable name in x
that will serve as the
explanatory variable. This is an alternative to using the formula argument.
The level of response
that will be considered a success, as
a string. Needed for inference on one proportion, a difference in
proportions, and corresponding z stats.
The null hypothesis. Options include "independence"
,
"point"
, and "paired independence"
.
independence
: Should be used with both a response
and explanatory
variable. Indicates that the values of the specified response
variable
are independent of the associated values in explanatory
.
point
: Should be used with only a response
variable. Indicates
that a point estimate based on the values in response
is associated
with a parameter. Sometimes requires supplying one of p
, mu
, med
, or
sigma
.
paired independence
: Should be used with only a response
variable
giving the pre-computed difference between paired observations. Indicates
that the order of subtraction between paired values does not affect the
resulting distribution.
The true proportion of successes (a number between 0 and 1). To be used with point null hypotheses when the specified response variable is categorical.
The true mean (any numerical value). To be used with point null hypotheses when the specified response variable is continuous.
The true median (any numerical value). To be used with point null hypotheses when the specified response variable is continuous.
The true standard deviation (any numerical value). To be used with point null hypotheses.
A string giving the type of the statistic to calculate. Current
options include "mean"
, "median"
, "sum"
, "sd"
, "prop"
, "count"
,
"diff in means"
, "diff in medians"
, "diff in props"
, "Chisq"
(or
"chisq"
), "F"
(or "f"
), "t"
, "z"
, "ratio of props"
, "slope"
,
"odds ratio"
, "ratio of means"
, and "correlation"
. infer
only
supports theoretical tests on one or two means via the "t"
distribution
and one or two proportions via the "z"
.
A string vector of specifying the order in which the levels of
the explanatory variable should be ordered for subtraction (or division
for ratio-based statistics), where order = c("first", "second")
means
("first" - "second")
, or the analogue for ratios. Needed for inference on
difference in means, medians, proportions, ratios, t, and z statistics.
To pass options like na.rm = TRUE
into functions like
mean(), sd(), etc. Can also be used to
supply hypothesized null values for the "t"
statistic or additional
arguments to stats::chisq.test()
.
A 1-column tibble containing the calculated statistic stat
.
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
prop_test()
,
t_stat()
,
t_test()
Other functions for calculating observed statistics:
chisq_stat()
,
t_stat()
# calculating the observed mean number of hours worked per week
gss %>%
observe(hours ~ NULL, stat = "mean")
#> Response: hours (numeric)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 41.4
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(response = hours) %>%
calculate(stat = "mean")
#> Response: hours (numeric)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 41.4
# calculating a t statistic for hypothesized mu = 40 hours worked/week
gss %>%
observe(hours ~ NULL, stat = "t", null = "point", mu = 40)
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 2.09
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 2.09
# similarly for a difference in means in age based on whether
# the respondent has a college degree
observe(
gss,
age ~ college,
stat = "diff in means",
order = c("degree", "no degree")
)
#> Response: age (numeric)
#> Explanatory: college (factor)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 0.941
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(age ~ college) %>%
calculate("diff in means", order = c("degree", "no degree"))
#> Response: age (numeric)
#> Explanatory: college (factor)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 0.941
# for a more in-depth explanation of how to use the infer package
if (FALSE) { # \dontrun{
vignette("infer")
} # }