dr.Rd
This is the main function in the dr package. It creates objects of class dr to estimate the central (mean) subspace and perform tests concerning its dimension. Several helper functions that require a dr object can then be applied to the output from this function.
dr (formula, data, subset, group=NULL, na.action = na.fail, weights, ...)
dr.compute (x, y, weights, group=NULL, method = "sir", chi2approx="bx",...)
a two-sided formula like y~x1+x2+x3
, where the left-side
variable is a vector or a matrix of the response variable(s), and the right-hand side
variables represent the predictors. While any legal formula in the Rogers-Wilkinson
notation can appear, dimension reduction methods generally expect the predictors to be
numeric, not factors, with no nesting. Full rank models are
recommended, although rank deficient models are permitted.
The left-hand side of the formula will generally be a single vector, but it
can also be a matrix, such as cbind(y1+y2)~x1+x2+x3
if the method
is "save"
or "sir"
. Both of these methods are based on slicing,
and for the multivariate case slices are determined by slicing on all the
columns of the left-hand side variables.
an optional data frame containing the variables in the model. By default the variables are taken from the environment from which `dr' is called.
an optional vector specifying a subset of observations to be used in the fitting process.
If used, this argument specifies a grouping variable so that
dimension reduction is done separately for each distinct level. This is
implemented only when method
is one of "sir"
,
"save"
, or "ire"
. This argument must be a one-sided formula.
For example, ~Location
would fit separately for each level of the variable
Location
. The formula ~A:B
would fit separately for each combination of
A
and B
, provided that both have been declared factors.
an optional vector of weights to be used where appropriate. In the context of dimension reduction methods, weights are used to obtain elliptical symmetry, not constant variance.
a function which indicates what should happen when the data contain `NA's. The default is `na.fail,' which will stop calculations. The option 'na.omit' is also permitted, but it may not work correctly when weights are used.
The design matrix. This will be computed from the formula by dr
and then
passed to dr.compute
, or you can create it yourself.
The response vector or matrix
This character string specifies the method of fitting. The options
include "sir"
, "save"
, "phdy"
, "phdres"
and
"ire"
. Each method may have its own additional arguments, or its
own defaults; see the details below for more information.
Several dr methods compute significance levels using
statistics that are asymptotically distributed as a linear combination of
\(\chi^2(1)\) random variables. This keyword chooses the method for
computing the chi2approx, either "bx"
, the default for a method
suggested by Bentler and Xie (2000) or "wood"
for a method proposed
by Wood (1989).
For dr
, all additional
arguments passed to dr.compute
. For
dr.compute
, additional
arguments may be required for particular dimension reduction method. For
example,
nslices
is the number of slices used by "sir"
and "save"
.
numdir
is the maximum number of directions to compute, with
default equal to 4. Other methods may have other defaults.
The general regression problem studies \(F(y|x)\), the conditional distribution of a response \(y\) given a set of predictors \(x\). This function provides methods for estimating the dimension and central subspace of a general regression problem. That is, we want to find a \(p \times d\) matrix \(B\) of minimal rank \(d\) such that $$F(y|x)=F(y|B'x)$$ Both the dimension \(d\) and the subspace \(R(B)\) are unknown. These methods make few assumptions. Many methods are based on the inverse distribution, \(F(x|y)\).
For the methods "sir"
, "save"
, "phdy"
and
"phdres"
, a kernel matrix \(M\) is estimated such that the
column space of \(M\) should be close to the central subspace
\(R(B)\). The eigenvectors corresponding to the d
largest
eigenvalues of \(M\) provide an estimate of \(R(B)\).
For the method "ire"
, subspaces are estimated by minimizing
an objective function.
Categorical predictors can be included using the groups
argument, with the methods "sir"
, "save"
and
"ire"
, using the ideas from Chiaromonte, Cook and Li (2002).
The primary output from this method is (1) a set of vectors whose
span estimates R(B)
; and various tests concerning the
dimension d
.
Weights can be used, essentially to specify the relative
frequency of each case in the data. Empirical weights that make
the contours of the weighted sample closer to elliptical can be
computed using dr.weights
.
This will usually result in zero weight for some
cases. The function will set zero estimated weights to missing.
dr returns an object that inherits from dr (the name of the type is the
value of the method
argument), with attributes:
The design matrix
The response vector
The weights used, normalized to add to n.
QR factorization of x.
Number of cases used.
The initial call to dr
.
A matrix that depends on the method of computing. The column space of M should be close to the central subspace.
The eigenvalues of M (or squared singular values if M is not symmetric).
The eigenvectors of M (or of M'M if M is not square and symmetric) ordered according to the eigenvalues.
Value of the input argument of this name.
The maximum number of directions to be found. The output value of numdir may be smaller than the input value.
output from 'sir.slice', used by sir and save.
the dimension reduction method used.
same as terms attribute in lm or glm. Needed to make update
work correctly.
If method="save"
, then A
is a three dimensional array needed to
compute test statistics.
Bentler, P. M. and Xie, J. (2000), Corrections to test statistics in principal Hessian directions. Statistics and Probability Letters, 47, 381-389. Approximate p-values.
Cook, R. D. (1998). Regression Graphics. New York: Wiley.
This book provides the basic results for dimension reduction
methods, including detailed discussion of the methods "sir"
,
"phdy"
and "phdres"
.
Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. Annals of Statistics, 32, 1062-1092. Introduced marginal coordinate tests.
Cook, R. D. and Nachtsheim, C. (1994), Reweighting to achieve
elliptically contoured predictors in regression. Journal of
the American Statistical Association, 89, 592–599. Describes the
weighting scheme used by dr.weights
.
Cook, R. D. and Ni, L. (2004). Sufficient dimension reduction via
inverse regression: A minimum discrrepancy approach, Journal
of the American Statistical Association, 100, 410-428. The
"ire"
is described in this paper.
Cook, R. D. and Weisberg, S. (1999). Applied Regression
Including Computing and Graphics, New York: Wiley,
http://www.stat.umn.edu/arc. The program arc
described
in this book also computes most of the dimension reduction methods
described here.
Chiaromonte, F., Cook, R. D. and Li, B. (2002). Sufficient dimension reduction in regressions with categorical predictors. Ann. Statist. 30 475-497. Introduced grouping, or conditioning on factors.
Shao, Y., Cook, R. D. and Weisberg (2007). Marginal tests with
sliced average variance estimation. Biometrika. Describes
the tests used for "save"
.
Wen, X. and Cook, R. D. (2007). Optimal Sufficient Dimension
Reduction in Regressions with Categorical Predictors, Journal
of Statistical Inference and Planning. This paper extends the
"ire"
method to grouping.
Wood, A. T. A. (1989) An \(F\) approximation to the distribution of a linear combination of chi-squared variables. Communications in Statistics: Simulation and Computation, 18, 1439-1456. Approximations for p-values.
data(ais)
# default fitting method is "sir"
s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+
log(Hc)+log(Ferr),data=ais)
# Refit, using a different function for slicing to agree with arc.
summary(s1 <- update(s0,slice.function=dr.slices.arc))
#>
#> Call:
#> dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +
#> log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc)
#>
#> Method:
#> sir with 11 slices, n = 202.
#>
#> Slice Sizes:
#> 19 19 19 19 19 19 19 18 18 18 15
#>
#> Estimated Basis Vectors for Central Subspace:
#> Dir1 Dir2 Dir3 Dir4
#> log(SSF) -0.143177 -0.0476079 -0.02815 0.003785
#> log(Wt) 0.879504 -0.1425841 0.23303 -0.094970
#> log(Hg) 0.195963 0.6318503 0.24483 -0.509424
#> log(Ht) 0.058923 -0.1100757 -0.87893 0.217803
#> log(WCC) 0.007276 -0.0029772 -0.05309 0.043056
#> log(RCC) 0.167736 0.3924936 -0.19711 -0.213689
#> log(Hc) -0.368652 -0.6418658 -0.26373 0.796849
#> log(Ferr) 0.002697 0.0002593 0.03492 0.039116
#>
#> Dir1 Dir2 Dir3 Dir4
#> Eigenvalues 0.9572 0.2275 0.09368 0.07319
#> R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864
#>
#> Large-sample Marginal Dimension Tests:
#> Stat df p.value
#> 0D vs >= 1D 284.78 80 0.00000
#> 1D vs >= 2D 91.43 63 0.01113
#> 2D vs >= 3D 45.48 48 0.57690
#> 3D vs >= 4D 26.55 35 0.84694
# Refit again, using save, with 10 slices; the default is max(8,ncol+3)
summary(s2<-update(s1,nslices=10,method="save"))
#>
#> Call:
#> dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +
#> log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc,
#> nslices = 10, method = "save")
#>
#> Method:
#> save with 10 slices, n = 202.
#>
#> Slice Sizes:
#> 21 21 20 20 20 25 24 22 20 9
#>
#> Estimated Basis Vectors for Central Subspace:
#> Dir1 Dir2 Dir3 Dir4
#> log(SSF) 0.127709 -0.00907 0.01018 -0.06144
#> log(Wt) -0.905004 -0.07107 -0.15734 0.25774
#> log(Hg) -0.056187 0.50674 -0.34064 -0.38087
#> log(Ht) 0.399868 0.36613 0.68439 -0.54216
#> log(WCC) 0.032608 0.02733 0.02277 0.03474
#> log(RCC) -0.008463 0.15137 -0.24136 -0.47219
#> log(Hc) -0.021630 -0.76164 0.57591 0.51526
#> log(Ferr) 0.002116 -0.01670 0.01631 -0.03360
#>
#> Dir1 Dir2 Dir3 Dir4
#> Eigenvalues 0.9389 0.6611 0.5129 0.4653
#> R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989
#>
#> Large-sample Marginal Dimension Tests:
#> Stat df(Nor) p.value(Nor) p.value(Gen)
#> 0D vs >= 1D 378.3 324 0.02012 0.1071
#> 1D vs >= 2D 279.6 252 0.11214 0.3116
#> 2D vs >= 3D 179.9 189 0.67101 0.5160
#> 3D vs >= 4D 134.3 135 0.50176 0.2786
# Refit, using phdres. Tests are different for phd, and not
# Fit using phdres; output is similar for phdy, but tests are not justifiable.
summary(s3<- update(s1,method="phdres"))
#>
#> Call:
#> dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) +
#> log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc,
#> method = "phdres")
#>
#> Method:
#> phdres, n = 202.
#>
#> Estimated Basis Vectors for Central Subspace:
#> Dir1 Dir2 Dir3 Dir4
#> log(SSF) -0.03675 -0.23340 0.001928 0.006563
#> log(Wt) 0.59536 0.03252 -0.238599 0.025140
#> log(Hg) -0.36061 -0.47699 -0.014747 -0.596972
#> log(Ht) 0.21613 -0.08133 0.959780 -0.038954
#> log(WCC) 0.02948 -0.07203 0.065847 -0.047897
#> log(RCC) -0.29816 -0.13669 -0.123638 -0.166642
#> log(Hc) 0.61429 0.82846 0.044891 0.781899
#> log(Ferr) -0.01761 0.01068 -0.005824 -0.001390
#>
#> Dir1 Dir2 Dir3 Dir4
#> Eigenvalues 2.8583 -1.4478 0.9612 -0.5621
#> R^2(OLS|dr) 0.8774 0.9444 0.9643 0.9891
#>
#> Large-sample Marginal Dimension Tests:
#> Stat df Normal theory Indep. test General theory
#> 0D vs >= 1D 223.67 36 0.000e+00 0 0.0005181
#> 1D vs >= 2D 69.64 28 2.091e-05 NA 0.0067417
#> 2D vs >= 3D 30.12 21 8.970e-02 NA 0.2343969
#> 3D vs >= 4D 12.70 15 6.257e-01 NA 0.2655593
# fit using ire:
summary(s4 <- update(s1,method="ire"))
#> Error in chol.default(kronecker(t(An), diag(rep(1, p))) %*% (((n - 1)/n) * cov(Gmat)) %*% kronecker(An, diag(rep(1, p)))): the leading minor of order 42 is not positive
# fit using Sex as a grouping variable.
s5 <- update(s4,group=~Sex)
#> Error: object 's4' not found