Normal scores transformation — blom • rcompanion

Normal scores transformation (Inverse normal transformation) by Elfving, Blom, van der Waerden, Tukey, and rankit methods, as well as z score transformation (standardization) and scaling to a range (normalization).

Usage

blom(
  x,
  method = "general",
  alpha = pi/8,
  complete = FALSE,
  na.last = "keep",
  na.rm = TRUE,
  adjustN = TRUE,
  min = 1,
  max = 10,
  ...
)

Arguments

x: A vector of numeric values.
method: Any one "general" (the default), "blom", vdw, "tukey", "elfving", "rankit", zscore, or scale.
alpha: A value used in the "general" method. If alpha=pi/8 (the default), the "general" method reduces to the "elfving" method. If alpha=3/8, the "general" method reduces to the "blom" method. If alpha=1/2, the "general" method reduces to the "rankit" method. If alpha=1/3, the "general" method reduces to the "tukey" method. If alpha=0, the "general" method reduces to the "vdw" method.
complete: If TRUE, NA values are removed before transformation. The default is FALSE.
na.last: Passed to rank in the normal scores methods. See the documentation for the rank function. The default is "keep".
na.rm: Used in the "zscore" and "scale" methods. Passed to mean, min, and max functions in those methods. The default is TRUE.
adjustN: If TRUE, the default, the normal scores methods use only non-NA values to determine the sample size, N. This seems to work well under default conditions where NA values are retained, even if there are a high percentage of NA values.
min: For the "scale" method, the minimum value of the transformed values.
max: For the "scale" method, the maximum value of the transformed values.
...: additional arguments passed to rank.

Value

A vector of numeric values.

Details

By default, NA values are retained in the output. This behavior can be changed with the na.rm argument for "zscore" and "scale" methods, or with na.last for the normal scores methods. Or NA values can be removed from the input with complete=TRUE.

For normal scores methods, if there are NA values or tied values, it is helpful to look up the documentation for rank.

In general, for normal scores methods, either of the arguments method or alpha can be used. With the current algorithms, there is no need to use both.

Normal scores transformation will return a normal distribution with a mean of 0 and a standard deviation of 1.

The "scale" method coverts values to the range specified in max and min without transforming the distribution of values. By default, the "scale" method converts values to a 1 to 10 range. Using the "scale" method with min = 0 and max = 1 is sometimes called "normalization".

The "zscore" method converts values by the usual method for z scores: (x - mean(x)) / sd(x). The transformed values with have a mean of 0 and a standard deviation of 1 but won't be coerced into a normal distribution. Sometimes this method is called "standardization".

Note

It's possible that Gustav Elfving didn't recommend the formula used in this function for the Elfving method. I would like thank Terence Cooke at the University of Exeter for their diligence at trying to track down a reference for this formula.

References

Conover, 1995, Practical Nonparametric Statistics, 3rd.

Solomon & Sawilowsky, 2009, Impact of rank-based normalizing transformations on the accuracy of test scores.

Beasley and Erickson, 2009, Rank-based inverse normal transformations are increasingly used, but are they merited?

Author

Salvatore Mangiafico, mangiafico@njaes.rutgers.edu

Examples

set.seed(12345)
A = rlnorm(100)
if (FALSE) hist(A) # \dontrun{}
### Convert data to normal scores by Elfving method
B = blom(A)
if (FALSE) hist(B) # \dontrun{}
### Convert data to z scores 
C = blom(A, method="zscore")
if (FALSE) hist(C) # \dontrun{}
### Convert data to a scale of 1 to 10 
D = blom(A, method="scale")
if (FALSE) hist(D) # \dontrun{}

### Data from Sokal and Rohlf, 1995, 
### Biometry: The Principles and Practice of Statistics
### in Biological Research
Value = c(709,679,699,657,594,677,592,538,476,508,505,539)
Sex   = c(rep("Male",3), rep("Female",3), rep("Male",3), rep("Female",3))
Fat   = c(rep("Fresh", 6), rep("Rancid", 6))
ValueBlom = blom(Value)
Sokal = data.frame(ValueBlom, Sex, Fat)
model = lm(ValueBlom ~ Sex * Fat, data=Sokal)
anova(model)
#> Analysis of Variance Table
#> 
#> Response: ValueBlom
#>           Df Sum Sq Mean Sq F value    Pr(>F)    
#> Sex        1 0.5399  0.5399  2.0932 0.1859728    
#> Fat        1 6.7936  6.7936 26.3374 0.0008939 ***
#> Sex:Fat    1 0.5938  0.5938  2.3022 0.1676690    
#> Residuals  8 2.0636  0.2579                      
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
if (FALSE) { # \dontrun{
hist(residuals(model))
plot(predict(model), residuals(model))
} # }