Functions to detect linear dependence

Little helper functions to aid users to detect linear dependent columns in a two-dimensional data structure, especially in a (transformed) model matrix - typically useful in interactive mode during model building phase.

detect.lindep(object, ...)

# S3 method for class 'matrix'
detect.lindep(object, suppressPrint = FALSE, ...)

# S3 method for class 'data.frame'
detect.lindep(object, suppressPrint = FALSE, ...)

# S3 method for class 'plm'
detect.lindep(object, suppressPrint = FALSE, ...)

# S3 method for class 'plm'
alias(object, ...)

# S3 method for class 'pdata.frame'
alias(
  object,
  model = c("pooling", "within", "Between", "between", "mean", "random", "fd"),
  effect = c("individual", "time", "twoways"),
  ...
)

Arguments

object: for detect.lindep: an object which should be checked for linear dependence (of class "matrix", "data.frame", or "plm"); for alias: either an estimated model of class "plm" or a "pdata.frame". Usually, one wants to input a model matrix here or check an already estimated plm model,
...: further arguments.
suppressPrint: for detect.lindep only: logical indicating whether a message shall be printed; defaults to printing the message, i. e., to suppressPrint = FALSE,
model: (see plm),
effect: (see plm),

Value

For detect.lindep: A named numeric vector containing column numbers of the linear dependent columns in the object after data transformation, if any are present. NULL if no linear dependent columns are detected.

For alias: return value of stats::alias.lm() run on the (quasi-)demeaned model, i. e., the information outputted applies to the transformed model matrix, not the original data.

Details

Linear dependence of columns/variables is (usually) readily avoided when building one's model. However, linear dependence is sometimes not obvious and harder to detect for less experienced applied statisticians. The so called "dummy variable trap" is a common and probably the best–known fallacy of this kind (see e. g. Wooldridge (2016), sec. 7-2.). When building linear models with lm or plm's pooling model, linear dependence in one's model is easily detected, at times post hoc.

However, linear dependence might also occur after some transformations of the data, albeit it is not present in the untransformed data. The within transformation (also called fixed effect transformation) used in the "within" model can result in such linear dependence and this is harder to come to mind when building a model. See Examples for two examples of linear dependent columns after the within transformation: ex. 1) the transformed variables have the opposite sign of one another; ex. 2) the transformed variables are identical.

During plm's model estimation, linear dependent columns and their corresponding coefficients in the resulting object are silently dropped, while the corresponding model frame and model matrix still contain the affected columns. The plm object contains an element aliased which indicates any such aliased coefficients by a named logical.

Both functions, detect.lindep and alias, help to detect linear dependence and accomplish almost the same: detect.lindep is a stand alone implementation while alias is a wrapper around stats::alias.lm(), extending the alias generic to classes "plm" and "pdata.frame". alias hinges on the availability of the package MASS on the system. Not all arguments of alias.lm are supported. Output of alias is more informative as it gives the linear combination of dependent columns (after data transformations, i. e., after (quasi)-demeaning) while detect.lindep only gives columns involved in the linear dependence in a simple format (thus being more suited for automatic post–processing of the information).

Note

function detect.lindep was called detect_lin_dep initially but renamed for naming consistency later.

References

Wooldridge JM (2013). Introductory Econometrics: a modern approach, 5th edition. South-Western (Cengage Learning).

Author

Kevin Tappe

Examples


### Example 1 ###
# prepare the data
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)

# setup a formula and a model frame
form <- price ~ 0 + cpi + fact1 + fact2
mf <- model.frame(Cigar.p, form)
# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear dependence)
detect.lindep(model.matrix(mf, model = "pooling"))
#> [1] "No linear dependent column(s) detected."
# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(mf, model = "within")
detect.lindep(modmat_FE)
#> [1] "Suspicious column number(s): 2, 3"
#> [1] "Suspicious column name(s):   fact1, fact2"
mod_FE <- plm(form, data = Cigar.p, model = "within")
detect.lindep(mod_FE) 
#> [1] "Suspicious column number(s): 2, 3"
#> [1] "Suspicious column name(s):   fact1, fact2"
alias(mod_FE) # => fact1 == -1*fact2
#> Model :
#> [1] "price ~ 0 + cpi + fact1 + fact2"
#> 
#> Complete :
#>       cpi fact1
#> fact2  0  -1   
#> 
plm(form, data = mf, model = "within")$aliased # "fact2" indicated as aliased
#>   cpi fact1 fact2 
#> FALSE FALSE  TRUE 

# look at the data: after FE transformation fact1 == -1*fact2
head(modmat_FE)
#>         cpi fact1 fact2
#> 1 -42.99667  -0.5   0.5
#> 2 -42.59667   0.5  -0.5
#> 3 -42.09667  -0.5   0.5
#> 4 -41.19667   0.5  -0.5
#> 5 -40.19667  -0.5   0.5
#> 6 -38.79667   0.5  -0.5
all.equal(modmat_FE[ , "fact1"], -1*modmat_FE[ , "fact2"])
#> [1] TRUE

### Example 2 ###
# Setup the data:
# Assume CEOs stay with the firms of the Grunfeld data
# for the firm's entire lifetime and assume some fictional
# data about CEO tenure and age in year 1935 (first observation
# in the data set) to be at 1 to 10 years and 38 to 55 years, respectively.
# => CEO tenure and CEO age increase by same value (+1 year per year).
data("Grunfeld", package = "plm")
set.seed(42)
# add fictional data
Grunfeld$CEOtenure <- c(replicate(10, seq(from=s<-sample(1:10,  1), to=s+19, by=1)))
Grunfeld$CEOage    <- c(replicate(10, seq(from=s<-sample(38:65, 1), to=s+19, by=1)))

# look at the data
head(Grunfeld, 50)
#>    firm year    inv  value capital CEOtenure CEOage
#> 1     1 1935  317.6 3078.5     2.8         1     44
#> 2     1 1936  391.8 4661.7    52.6         2     45
#> 3     1 1937  410.6 5387.1   156.9         3     46
#> 4     1 1938  257.7 2792.2   209.2         4     47
#> 5     1 1939  330.8 4313.2   203.4         5     48
#> 6     1 1940  461.2 4643.9   207.2         6     49
#> 7     1 1941  512.0 4551.2   255.2         7     50
#> 8     1 1942  448.0 3244.1   303.7         8     51
#> 9     1 1943  499.6 4053.7   264.1         9     52
#> 10    1 1944  547.5 4379.3   201.6        10     53
#> 11    1 1945  561.2 4840.9   265.0        11     54
#> 12    1 1946  688.1 4900.9   402.2        12     55
#> 13    1 1947  568.9 3526.5   761.5        13     56
#> 14    1 1948  529.2 3254.7   922.4        14     57
#> 15    1 1949  555.1 3700.2  1020.1        15     58
#> 16    1 1950  642.9 3755.6  1099.0        16     59
#> 17    1 1951  755.9 4833.0  1207.7        17     60
#> 18    1 1952  891.2 4924.9  1430.5        18     61
#> 19    1 1953 1304.4 6241.7  1777.3        19     62
#> 20    1 1954 1486.7 5593.6  2226.3        20     63
#> 21    2 1935  209.9 1362.4    53.8         5     41
#> 22    2 1936  355.3 1807.1    50.5         6     42
#> 23    2 1937  469.9 2676.3   118.1         7     43
#> 24    2 1938  262.3 1801.9   260.2         8     44
#> 25    2 1939  230.4 1957.3   312.7         9     45
#> 26    2 1940  361.6 2202.9   254.2        10     46
#> 27    2 1941  472.8 2380.5   261.4        11     47
#> 28    2 1942  445.6 2168.6   298.7        12     48
#> 29    2 1943  361.6 1985.1   301.8        13     49
#> 30    2 1944  288.2 1813.9   279.1        14     50
#> 31    2 1945  258.7 1850.2   213.8        15     51
#> 32    2 1946  420.3 2067.7   132.6        16     52
#> 33    2 1947  420.5 1796.7   264.8        17     53
#> 34    2 1948  494.5 1625.8   306.9        18     54
#> 35    2 1949  405.1 1667.0   351.1        19     55
#> 36    2 1950  418.8 1677.4   357.8        20     56
#> 37    2 1951  588.2 2289.5   342.1        21     57
#> 38    2 1952  645.5 2159.4   444.2        22     58
#> 39    2 1953  641.0 2031.3   623.6        23     59
#> 40    2 1954  459.3 2115.5   669.7        24     60
#> 41    3 1935   33.1 1170.6    97.8         1     62
#> 42    3 1936   45.0 2015.8   104.4         2     63
#> 43    3 1937   77.2 2803.3   118.0         3     64
#> 44    3 1938   44.6 2039.7   156.2         4     65
#> 45    3 1939   48.1 2256.2   172.6         5     66
#> 46    3 1940   74.4 2132.2   186.6         6     67
#> 47    3 1941  113.0 1834.1   220.9         7     68
#> 48    3 1942   91.9 1588.0   287.8         8     69
#> 49    3 1943   61.3 1749.4   319.9         9     70
#> 50    3 1944   56.8 1687.2   321.3        10     71

form <- inv ~ value + capital + CEOtenure + CEOage
mf <- model.frame(pdata.frame(Grunfeld), form)
# no linear dependent columns in original data/pooling model
modmat_pool <- model.matrix(mf, model="pooling")
detect.lindep(modmat_pool)
#> [1] "No linear dependent column(s) detected."
mod_pool <- plm(form, data = Grunfeld, model = "pooling")
alias(mod_pool)
#> Model :
#> [1] "inv ~ value + capital + CEOtenure + CEOage"
#> 

# CEOtenure and CEOage are linear dependent after FE transformation
# (demeaning per individual)
modmat_FE <- model.matrix(mf, model="within")
detect.lindep(modmat_FE)
#> [1] "Suspicious column number(s): 3, 4"
#> [1] "Suspicious column name(s):   CEOtenure, CEOage"
mod_FE <- plm(form, data = Grunfeld, model = "within")
detect.lindep(mod_FE)
#> [1] "Suspicious column number(s): 3, 4"
#> [1] "Suspicious column name(s):   CEOtenure, CEOage"
alias(mod_FE)
#> Model :
#> [1] "inv ~ value + capital + CEOtenure + CEOage"
#> 
#> Complete :
#>        value capital CEOtenure
#> CEOage 0     0       1        
#> 

# look at the transformed data: after FE transformation CEOtenure == 1*CEOage
head(modmat_FE, 50)
#>        value  capital CEOtenure CEOage
#> 1  -1255.345 -645.635      -9.5   -9.5
#> 2    327.855 -595.835      -8.5   -8.5
#> 3   1053.255 -491.535      -7.5   -7.5
#> 4  -1541.645 -439.235      -6.5   -6.5
#> 5    -20.645 -445.035      -5.5   -5.5
#> 6    310.055 -441.235      -4.5   -4.5
#> 7    217.355 -393.235      -3.5   -3.5
#> 8  -1089.745 -344.735      -2.5   -2.5
#> 9   -280.145 -384.335      -1.5   -1.5
#> 10    45.455 -446.835      -0.5   -0.5
#> 11   507.055 -383.435       0.5    0.5
#> 12   567.055 -246.235       1.5    1.5
#> 13  -807.345  113.065       2.5    2.5
#> 14 -1079.145  273.965       3.5    3.5
#> 15  -633.645  371.665       4.5    4.5
#> 16  -578.245  450.565       5.5    5.5
#> 17   499.155  559.265       6.5    6.5
#> 18   591.055  782.065       7.5    7.5
#> 19  1907.855 1128.865       8.5    8.5
#> 20  1259.755 1577.865       9.5    9.5
#> 21  -609.425 -241.055      -9.5   -9.5
#> 22  -164.725 -244.355      -8.5   -8.5
#> 23   704.475 -176.755      -7.5   -7.5
#> 24  -169.925  -34.655      -6.5   -6.5
#> 25   -14.525   17.845      -5.5   -5.5
#> 26   231.075  -40.655      -4.5   -4.5
#> 27   408.675  -33.455      -3.5   -3.5
#> 28   196.775    3.845      -2.5   -2.5
#> 29    13.275    6.945      -1.5   -1.5
#> 30  -157.925  -15.755      -0.5   -0.5
#> 31  -121.625  -81.055       0.5    0.5
#> 32    95.875 -162.255       1.5    1.5
#> 33  -175.125  -30.055       2.5    2.5
#> 34  -346.025   12.045       3.5    3.5
#> 35  -304.825   56.245       4.5    4.5
#> 36  -294.425   62.945       5.5    5.5
#> 37   317.675   47.245       6.5    6.5
#> 38   187.575  149.345       7.5    7.5
#> 39    59.475  328.745       8.5    8.5
#> 40   143.675  374.845       9.5    9.5
#> 41  -770.725 -302.360      -9.5   -9.5
#> 42    74.475 -295.760      -8.5   -8.5
#> 43   861.975 -282.160      -7.5   -7.5
#> 44    98.375 -243.960      -6.5   -6.5
#> 45   314.875 -227.560      -5.5   -5.5
#> 46   190.875 -213.560      -4.5   -4.5
#> 47  -107.225 -179.260      -3.5   -3.5
#> 48  -353.325 -112.360      -2.5   -2.5
#> 49  -191.925  -80.260      -1.5   -1.5
#> 50  -254.125  -78.860      -0.5   -0.5
all.equal(modmat_FE[ , "CEOtenure"], modmat_FE[ , "CEOage"])
#> [1] TRUE