Little helper functions to aid users to detect linear dependent columns in a two-dimensional data structure, especially in a (transformed) model matrix - typically useful in interactive mode during model building phase.
detect.lindep(object, ...)
# S3 method for class 'matrix'
detect.lindep(object, suppressPrint = FALSE, ...)
# S3 method for class 'data.frame'
detect.lindep(object, suppressPrint = FALSE, ...)
# S3 method for class 'plm'
detect.lindep(object, suppressPrint = FALSE, ...)
# S3 method for class 'plm'
alias(object, ...)
# S3 method for class 'pdata.frame'
alias(
object,
model = c("pooling", "within", "Between", "between", "mean", "random", "fd"),
effect = c("individual", "time", "twoways"),
...
)for detect.lindep: an object which should be checked
for linear dependence (of class "matrix", "data.frame", or
"plm"); for alias: either an estimated model of class
"plm" or a "pdata.frame". Usually, one wants to input a model
matrix here or check an already estimated plm model,
further arguments.
for detect.lindep only: logical indicating
whether a message shall be printed; defaults to printing the message, i. e.,
to suppressPrint = FALSE,
(see plm),
(see plm),
For detect.lindep: A named numeric vector containing column
numbers of the linear dependent columns in the object after data
transformation, if any are present. NULL if no linear dependent
columns are detected.
For alias: return value of stats::alias.lm() run on the
(quasi-)demeaned model, i. e., the information outputted applies to
the transformed model matrix, not the original data.
Linear dependence of columns/variables is (usually) readily avoided when
building one's model. However, linear dependence is sometimes not obvious
and harder to detect for less experienced applied statisticians. The so
called "dummy variable trap" is a common and probably the best–known
fallacy of this kind (see e. g. Wooldridge (2016), sec. 7-2.). When building
linear models with lm or plm's pooling model, linear
dependence in one's model is easily detected, at times post hoc.
However, linear dependence might also occur after some transformations of
the data, albeit it is not present in the untransformed data. The within
transformation (also called fixed effect transformation) used in the
"within" model can result in such linear dependence and this is
harder to come to mind when building a model. See Examples for two
examples of linear dependent columns after the within transformation: ex. 1)
the transformed variables have the opposite sign of one another; ex. 2) the
transformed variables are identical.
During plm's model estimation, linear dependent columns and their
corresponding coefficients in the resulting object are silently dropped,
while the corresponding model frame and model matrix still contain the
affected columns. The plm object contains an element aliased which
indicates any such aliased coefficients by a named logical.
Both functions, detect.lindep and alias, help to
detect linear dependence and accomplish almost the same:
detect.lindep is a stand alone implementation while
alias is a wrapper around
stats::alias.lm(), extending the alias
generic to classes "plm" and "pdata.frame".
alias hinges on the availability of the package
MASS on the system. Not all arguments of alias.lm
are supported. Output of alias is more informative as it
gives the linear combination of dependent columns (after data
transformations, i. e., after (quasi)-demeaning) while
detect.lindep only gives columns involved in the linear
dependence in a simple format (thus being more suited for automatic
post–processing of the information).
function detect.lindep was called detect_lin_dep
initially but renamed for naming consistency later.
Wooldridge JM (2013). Introductory Econometrics: a modern approach, 5th edition. South-Western (Cengage Learning).
stats::alias(), stats::model.matrix() and especially
plm's model.matrix() for (transformed) model matrices,
plm's model.frame().
### Example 1 ###
# prepare the data
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# setup a formula and a model frame
form <- price ~ 0 + cpi + fact1 + fact2
mf <- model.frame(Cigar.p, form)
# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear dependence)
detect.lindep(model.matrix(mf, model = "pooling"))
#> [1] "No linear dependent column(s) detected."
# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(mf, model = "within")
detect.lindep(modmat_FE)
#> [1] "Suspicious column number(s): 2, 3"
#> [1] "Suspicious column name(s): fact1, fact2"
mod_FE <- plm(form, data = Cigar.p, model = "within")
detect.lindep(mod_FE)
#> [1] "Suspicious column number(s): 2, 3"
#> [1] "Suspicious column name(s): fact1, fact2"
alias(mod_FE) # => fact1 == -1*fact2
#> Model :
#> [1] "price ~ 0 + cpi + fact1 + fact2"
#>
#> Complete :
#> cpi fact1
#> fact2 0 -1
#>
plm(form, data = mf, model = "within")$aliased # "fact2" indicated as aliased
#> cpi fact1 fact2
#> FALSE FALSE TRUE
# look at the data: after FE transformation fact1 == -1*fact2
head(modmat_FE)
#> cpi fact1 fact2
#> 1 -42.99667 -0.5 0.5
#> 2 -42.59667 0.5 -0.5
#> 3 -42.09667 -0.5 0.5
#> 4 -41.19667 0.5 -0.5
#> 5 -40.19667 -0.5 0.5
#> 6 -38.79667 0.5 -0.5
all.equal(modmat_FE[ , "fact1"], -1*modmat_FE[ , "fact2"])
#> [1] TRUE
### Example 2 ###
# Setup the data:
# Assume CEOs stay with the firms of the Grunfeld data
# for the firm's entire lifetime and assume some fictional
# data about CEO tenure and age in year 1935 (first observation
# in the data set) to be at 1 to 10 years and 38 to 55 years, respectively.
# => CEO tenure and CEO age increase by same value (+1 year per year).
data("Grunfeld", package = "plm")
set.seed(42)
# add fictional data
Grunfeld$CEOtenure <- c(replicate(10, seq(from=s<-sample(1:10, 1), to=s+19, by=1)))
Grunfeld$CEOage <- c(replicate(10, seq(from=s<-sample(38:65, 1), to=s+19, by=1)))
# look at the data
head(Grunfeld, 50)
#> firm year inv value capital CEOtenure CEOage
#> 1 1 1935 317.6 3078.5 2.8 1 44
#> 2 1 1936 391.8 4661.7 52.6 2 45
#> 3 1 1937 410.6 5387.1 156.9 3 46
#> 4 1 1938 257.7 2792.2 209.2 4 47
#> 5 1 1939 330.8 4313.2 203.4 5 48
#> 6 1 1940 461.2 4643.9 207.2 6 49
#> 7 1 1941 512.0 4551.2 255.2 7 50
#> 8 1 1942 448.0 3244.1 303.7 8 51
#> 9 1 1943 499.6 4053.7 264.1 9 52
#> 10 1 1944 547.5 4379.3 201.6 10 53
#> 11 1 1945 561.2 4840.9 265.0 11 54
#> 12 1 1946 688.1 4900.9 402.2 12 55
#> 13 1 1947 568.9 3526.5 761.5 13 56
#> 14 1 1948 529.2 3254.7 922.4 14 57
#> 15 1 1949 555.1 3700.2 1020.1 15 58
#> 16 1 1950 642.9 3755.6 1099.0 16 59
#> 17 1 1951 755.9 4833.0 1207.7 17 60
#> 18 1 1952 891.2 4924.9 1430.5 18 61
#> 19 1 1953 1304.4 6241.7 1777.3 19 62
#> 20 1 1954 1486.7 5593.6 2226.3 20 63
#> 21 2 1935 209.9 1362.4 53.8 5 41
#> 22 2 1936 355.3 1807.1 50.5 6 42
#> 23 2 1937 469.9 2676.3 118.1 7 43
#> 24 2 1938 262.3 1801.9 260.2 8 44
#> 25 2 1939 230.4 1957.3 312.7 9 45
#> 26 2 1940 361.6 2202.9 254.2 10 46
#> 27 2 1941 472.8 2380.5 261.4 11 47
#> 28 2 1942 445.6 2168.6 298.7 12 48
#> 29 2 1943 361.6 1985.1 301.8 13 49
#> 30 2 1944 288.2 1813.9 279.1 14 50
#> 31 2 1945 258.7 1850.2 213.8 15 51
#> 32 2 1946 420.3 2067.7 132.6 16 52
#> 33 2 1947 420.5 1796.7 264.8 17 53
#> 34 2 1948 494.5 1625.8 306.9 18 54
#> 35 2 1949 405.1 1667.0 351.1 19 55
#> 36 2 1950 418.8 1677.4 357.8 20 56
#> 37 2 1951 588.2 2289.5 342.1 21 57
#> 38 2 1952 645.5 2159.4 444.2 22 58
#> 39 2 1953 641.0 2031.3 623.6 23 59
#> 40 2 1954 459.3 2115.5 669.7 24 60
#> 41 3 1935 33.1 1170.6 97.8 1 62
#> 42 3 1936 45.0 2015.8 104.4 2 63
#> 43 3 1937 77.2 2803.3 118.0 3 64
#> 44 3 1938 44.6 2039.7 156.2 4 65
#> 45 3 1939 48.1 2256.2 172.6 5 66
#> 46 3 1940 74.4 2132.2 186.6 6 67
#> 47 3 1941 113.0 1834.1 220.9 7 68
#> 48 3 1942 91.9 1588.0 287.8 8 69
#> 49 3 1943 61.3 1749.4 319.9 9 70
#> 50 3 1944 56.8 1687.2 321.3 10 71
form <- inv ~ value + capital + CEOtenure + CEOage
mf <- model.frame(pdata.frame(Grunfeld), form)
# no linear dependent columns in original data/pooling model
modmat_pool <- model.matrix(mf, model="pooling")
detect.lindep(modmat_pool)
#> [1] "No linear dependent column(s) detected."
mod_pool <- plm(form, data = Grunfeld, model = "pooling")
alias(mod_pool)
#> Model :
#> [1] "inv ~ value + capital + CEOtenure + CEOage"
#>
# CEOtenure and CEOage are linear dependent after FE transformation
# (demeaning per individual)
modmat_FE <- model.matrix(mf, model="within")
detect.lindep(modmat_FE)
#> [1] "Suspicious column number(s): 3, 4"
#> [1] "Suspicious column name(s): CEOtenure, CEOage"
mod_FE <- plm(form, data = Grunfeld, model = "within")
detect.lindep(mod_FE)
#> [1] "Suspicious column number(s): 3, 4"
#> [1] "Suspicious column name(s): CEOtenure, CEOage"
alias(mod_FE)
#> Model :
#> [1] "inv ~ value + capital + CEOtenure + CEOage"
#>
#> Complete :
#> value capital CEOtenure
#> CEOage 0 0 1
#>
# look at the transformed data: after FE transformation CEOtenure == 1*CEOage
head(modmat_FE, 50)
#> value capital CEOtenure CEOage
#> 1 -1255.345 -645.635 -9.5 -9.5
#> 2 327.855 -595.835 -8.5 -8.5
#> 3 1053.255 -491.535 -7.5 -7.5
#> 4 -1541.645 -439.235 -6.5 -6.5
#> 5 -20.645 -445.035 -5.5 -5.5
#> 6 310.055 -441.235 -4.5 -4.5
#> 7 217.355 -393.235 -3.5 -3.5
#> 8 -1089.745 -344.735 -2.5 -2.5
#> 9 -280.145 -384.335 -1.5 -1.5
#> 10 45.455 -446.835 -0.5 -0.5
#> 11 507.055 -383.435 0.5 0.5
#> 12 567.055 -246.235 1.5 1.5
#> 13 -807.345 113.065 2.5 2.5
#> 14 -1079.145 273.965 3.5 3.5
#> 15 -633.645 371.665 4.5 4.5
#> 16 -578.245 450.565 5.5 5.5
#> 17 499.155 559.265 6.5 6.5
#> 18 591.055 782.065 7.5 7.5
#> 19 1907.855 1128.865 8.5 8.5
#> 20 1259.755 1577.865 9.5 9.5
#> 21 -609.425 -241.055 -9.5 -9.5
#> 22 -164.725 -244.355 -8.5 -8.5
#> 23 704.475 -176.755 -7.5 -7.5
#> 24 -169.925 -34.655 -6.5 -6.5
#> 25 -14.525 17.845 -5.5 -5.5
#> 26 231.075 -40.655 -4.5 -4.5
#> 27 408.675 -33.455 -3.5 -3.5
#> 28 196.775 3.845 -2.5 -2.5
#> 29 13.275 6.945 -1.5 -1.5
#> 30 -157.925 -15.755 -0.5 -0.5
#> 31 -121.625 -81.055 0.5 0.5
#> 32 95.875 -162.255 1.5 1.5
#> 33 -175.125 -30.055 2.5 2.5
#> 34 -346.025 12.045 3.5 3.5
#> 35 -304.825 56.245 4.5 4.5
#> 36 -294.425 62.945 5.5 5.5
#> 37 317.675 47.245 6.5 6.5
#> 38 187.575 149.345 7.5 7.5
#> 39 59.475 328.745 8.5 8.5
#> 40 143.675 374.845 9.5 9.5
#> 41 -770.725 -302.360 -9.5 -9.5
#> 42 74.475 -295.760 -8.5 -8.5
#> 43 861.975 -282.160 -7.5 -7.5
#> 44 98.375 -243.960 -6.5 -6.5
#> 45 314.875 -227.560 -5.5 -5.5
#> 46 190.875 -213.560 -4.5 -4.5
#> 47 -107.225 -179.260 -3.5 -3.5
#> 48 -353.325 -112.360 -2.5 -2.5
#> 49 -191.925 -80.260 -1.5 -1.5
#> 50 -254.125 -78.860 -0.5 -0.5
all.equal(modmat_FE[ , "CEOtenure"], modmat_FE[ , "CEOage"])
#> [1] TRUE