Fast (Weighted) Sample Quantiles and Range

A faster alternative to quantile (written fully in C), that supports sampling weights, and can also quickly compute quantiles from an ordering vector (e.g. order(x)). frange provides a fast alternative to range.

fquantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL,
          o = if(length(x) > 1e5L && length(probs) > log(length(x)))
              radixorder(x) else NULL,
          na.rm = .op[["na.rm"]], type = 7L, names = TRUE,
          check.o = is.null(attr(o, "sorted")))

# Programmers version: no names, intelligent defaults, or checks
.quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL, o = NULL,
          na.rm = TRUE, type = 7L, names = FALSE, check.o = FALSE)

# Fast range (min and max)
frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
.range(x, na.rm = TRUE, finite = FALSE)

Arguments

x: a numeric or integer vector.
probs: numeric vector of probabilities with values in [0,1].
w: a numeric vector of strictly positive sampling weights. Missing weights are only supported if x is also missing.
o: integer. An vector giving the ordering of the elements in x, such that identical(x[o], sort(x)). If available this considerably speeds up the estimation.
na.rm: logical. Remove missing values, default TRUE.
finite: logical. Omit all non-finite values.
type: integer. Quantile types 4-9. See quantile. Further details are provided in Hyndman and Fan (1996) who recommended type 8. The default method is type 7.
names: logical. Generates names of the form paste0(round(probs * 100, 1), "%") (in C). Set to FALSE for speedup.
check.o: logical. If o is supplied, TRUE runs through o once and checks that it is valid, i.e. that each element is in [1, length(x)]. Set to FALSE for significant speedup if o is known to be valid.

Details

fquantile is implemented using a quickselect algorithm in C, inspired by data.table's gmedian. The algorithm is applied incrementally to different sections of the array to find individual quantiles. If many quantile probabilities are requested, sorting the whole array with the fast radixorder algorithm is more efficient. The default threshold for this (length(x) > 1e5L && length(probs) > log(length(x))) is conservative, given that quickselect is generally more efficient on longitudinal data with similar values repeated by groups. With random data, my investigations yield that a threshold of length(probs) > log10(length(x)) would be more appropriate.

frange is considerably more efficient than range, requiring only one pass through the data instead of two. For probabilities 0 and 1, fquantile internally calls frange.

Following Hyndman and Fan (1996), the quantile type-$i$ quantile function of the sample $X$ can be written as a weighted average of two order statistics:

$$\hat{Q}_{X,i}(p) = (1 - \gamma) X_{(j)} + \gamma X_{(j + 1)}$$

where $j = \lfloor pn + m \rfloor,\ m \in \mathbb{R}$ and $\gamma = pn + m - j,\ 0 \le \gamma \le 1$, with $m$ differing by quantile type ($i$). For example, the default type 7 quantile estimator uses $m = 1 - p$, see quantile.

For weighted data with normalized weights $w = \{w_1, ..., w_n\}$, where $w_k > 0$ and $\sum_k w_k = 1$, let $\{w_{(1)}, ..., w_{(n)}\}$ be the weights for each order statistic and $W_{(k)} = \operatorname{Weight}[X_j \le X_{(k)}] = \sum_{j=1}^k w_{(j)}$ the cumulative weight for each order statistic.

We can then first find the largest value $l$ such that the cumulative normalized weight $W_{(l)} \leq p$, and replace $pn$ with $l + (p - W_{(l)})/w_{(l+1)}$, where $w_{(l+1)}$ is the weight of the next observation. This gives:

$$j = \lfloor l + \frac{p - W_{(l)}}{w_{(l+1)}} + m \rfloor$$ $$\gamma = l + \frac{p - W_{(l)}}{w_{(l+1)}} + m - j$$

For a more detailed exposition see these excellent notes by Matthew Kay. See also the R implementation of weighted quantiles type 7 in the Examples below.

Note

The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if $j$ hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictly positive.

Value

A vector of quantiles. If names = TRUE, fquantile generates names as paste0(round(probs * 100, 1), "%") (in C).

Author

Sebastian Krantz based on notes by Matthew Kay.

References

Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365. doi:10.2307/2684934.

Wicklin, R. (2017) Sample quantiles: A comparison of 9 definitions; SAS Blog. https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

Wikipedia: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample

Weighted Quantiles by Matthew Kay: https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html

Examples

## Basic range and quantiles
frange(mtcars$mpg)
#> [1] 10.4 33.9
fquantile(mtcars$mpg)
#>     0%    25%    50%    75%   100% 
#> 10.400 15.425 19.200 22.800 33.900 

## Checking computational equivalence to stats::quantile()
w = alloc(abs(rnorm(1)), 32)
o = radixorder(mtcars$mpg)
for (i in 5:9) print(all_obj_equal(fquantile(mtcars$mpg, type = i),
                                   fquantile(mtcars$mpg, type = i, w = w),
                                   fquantile(mtcars$mpg, type = i, o = o),
                                   fquantile(mtcars$mpg, type = i, w = w, o = o),
                                    quantile(mtcars$mpg, type = i)))
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE

## Demonstaration: weighted quantiles type 7 in R
wquantile7R <- function(x, w, probs = c(0.25, 0.5, 0.75), na.rm = TRUE, names = TRUE) {
  if(na.rm && anyNA(x)) {             # Removing missing values (only in x)
    cc = whichNA(x, invert = TRUE)    # The C code first calls radixorder(x), which places
    x = x[cc]; w = w[cc]              # missing values last, so removing = early termination
  }
  o = radixorder(x)                   # Ordering
  wo = proportions(w[o])
  Wo = cumsum(wo)                     # Cumulative sum
  res = sapply(probs, function(p) {
    l = which.max(Wo > p) - 1L        # Lower order statistic
    s = l + (p - Wo[l])/wo[l+1L] + 1 - p
    j = floor(s)
    gamma = s - j
    (1 - gamma) * x[o[j]] + gamma * x[o[j+1L]]  # Weighted quantile
  })
  if(names) names(res) = paste0(as.integer(probs * 100), "%")
  res
} # Note: doesn't work for min and max.

wquantile7R(mtcars$mpg, mtcars$wt)
#>      25%      50%      75% 
#> 15.07936 17.89174 21.40000 

all.equal(wquantile7R(mtcars$mpg, mtcars$wt),
          fquantile(mtcars$mpg, c(0.25, 0.5, 0.75), mtcars$wt))
#> [1] TRUE

## Efficient grouped quantile estimation: use .quantile for less call overhead
BY(mtcars$mpg, mtcars$cyl, .quantile, names = TRUE, expand.wide = TRUE)
#>     0%   25%  50%   75% 100%
#> 4 21.4 22.80 26.0 30.40 33.9
#> 6 17.8 18.65 19.7 21.00 21.4
#> 8 10.4 14.40 15.2 16.25 19.2
BY(mtcars, mtcars$cyl, .quantile, names = TRUE)
#>          mpg cyl   disp     hp  drat      wt    qsec vs  am gear carb
#> 4.0%   21.40   4  71.10  52.00 3.690 1.51300 16.7000  0 0.0  3.0 1.00
#> 4.25%  22.80   4  78.85  65.50 3.810 1.88500 18.5600  1 0.5  4.0 1.00
#> 4.50%  26.00   4 108.00  91.00 4.080 2.20000 18.9000  1 1.0  4.0 2.00
#> 4.75%  30.40   4 120.65  96.00 4.165 2.62250 19.9500  1 1.0  4.0 2.00
#> 4.100% 33.90   4 146.70 113.00 4.930 3.19000 22.9000  1 1.0  5.0 2.00
#> 6.0%   17.80   6 145.00 105.00 2.760 2.62000 15.5000  0 0.0  3.0 1.00
#> 6.25%  18.65   6 160.00 110.00 3.350 2.82250 16.7400  0 0.0  3.5 2.50
#> 6.50%  19.70   6 167.60 110.00 3.900 3.21500 18.3000  1 0.0  4.0 4.00
#> 6.75%  21.00   6 196.30 123.00 3.910 3.44000 19.1700  1 1.0  4.0 4.00
#> 6.100% 21.40   6 258.00 175.00 3.920 3.46000 20.2200  1 1.0  5.0 6.00
#> 8.0%   10.40   8 275.80 150.00 2.760 3.17000 14.5000  0 0.0  3.0 2.00
#> 8.25%  14.40   8 301.75 176.25 3.070 3.53250 16.0975  0 0.0  3.0 2.25
#> 8.50%  15.20   8 350.50 192.50 3.115 3.75500 17.1750  0 0.0  3.0 3.50
#> 8.75%  16.25   8 390.00 241.25 3.225 4.01375 17.5550  0 0.0  3.0 4.00
#> 8.100% 19.20   8 472.00 335.00 4.220 5.42400 18.0000  0 1.0  5.0 8.00
mtcars |> fgroup_by(cyl) |> BY(.quantile)
#>    cyl   mpg   disp     hp  drat      wt    qsec vs  am gear carb
#> 1    4 21.40  71.10  52.00 3.690 1.51300 16.7000  0 0.0  3.0 1.00
#> 2    4 22.80  78.85  65.50 3.810 1.88500 18.5600  1 0.5  4.0 1.00
#> 3    4 26.00 108.00  91.00 4.080 2.20000 18.9000  1 1.0  4.0 2.00
#> 4    4 30.40 120.65  96.00 4.165 2.62250 19.9500  1 1.0  4.0 2.00
#> 5    4 33.90 146.70 113.00 4.930 3.19000 22.9000  1 1.0  5.0 2.00
#> 6    6 17.80 145.00 105.00 2.760 2.62000 15.5000  0 0.0  3.0 1.00
#> 7    6 18.65 160.00 110.00 3.350 2.82250 16.7400  0 0.0  3.5 2.50
#> 8    6 19.70 167.60 110.00 3.900 3.21500 18.3000  1 0.0  4.0 4.00
#> 9    6 21.00 196.30 123.00 3.910 3.44000 19.1700  1 1.0  4.0 4.00
#> 10   6 21.40 258.00 175.00 3.920 3.46000 20.2200  1 1.0  5.0 6.00
#> 11   8 10.40 275.80 150.00 2.760 3.17000 14.5000  0 0.0  3.0 2.00
#> 12   8 14.40 301.75 176.25 3.070 3.53250 16.0975  0 0.0  3.0 2.25
#> 13   8 15.20 350.50 192.50 3.115 3.75500 17.1750  0 0.0  3.0 3.50
#> 14   8 16.25 390.00 241.25 3.225 4.01375 17.5550  0 0.0  3.0 4.00
#> 15   8 19.20 472.00 335.00 4.220 5.42400 18.0000  0 1.0  5.0 8.00

## With weights
BY(mtcars$mpg, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE, expand.wide = TRUE)
#>     0%      25%      50%      75% 100%
#> 4 21.4 22.80000 24.53116 29.75889 33.9
#> 6 17.8 18.46561 19.55289 21.00000 21.4
#> 8 10.4 13.91543 15.15267 16.11036 19.2
BY(mtcars, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE)
#>             mpg cyl     disp        hp     drat       wt     qsec vs am
#> 4.0%   21.40000   4  71.1000  52.00000 3.690000 1.513000 16.70000  0  0
#> 4.25%  22.80000   4  80.2647  65.55695 3.783351 2.023886 18.60116  1  0
#> 4.50%  24.53116   4 119.7122  91.67897 3.996622 2.330844 19.25457  1  1
#> 4.75%  29.75889   4 126.2910  95.88316 4.109992 2.878872 20.00040  1  1
#> 4.100% 33.90000   4 146.7000 113.00000 4.930000 3.190000 22.90000  1  1
#> 6.0%   17.80000   6 145.0000 105.00000 2.760000 2.620000 15.50000  0  0
#> 6.25%  18.46561   6 160.0000 110.00000 3.280086 2.851124 16.89266  0  0
#> 6.50%  19.55289   6 167.6000 111.40513 3.900000 3.287609 18.46134  1  0
#> 6.75%  21.00000   6 202.1893 123.00000 3.913285 3.440000 19.23991  1  1
#> 6.100% 21.40000   6 258.0000 175.00000 3.920000 3.460000 20.22000  1  1
#> 8.0%   10.40000   8 275.8000 150.00000 2.760000 3.170000 14.50000  0  0
#> 8.25%  13.91543   8 302.2813 178.43433 3.049961 3.563554 16.56843  0  0
#> 8.50%  15.15267   8 352.3188 202.85476 3.085425 3.806500 17.33849  0  0
#> 8.75%  16.11036   8 425.1300 234.39181 3.222213 4.819412 17.67353  0  0
#> 8.100% 19.20000   8 472.0000 335.00000 4.220000 5.424000 18.00000  0  1
#>            gear     carb
#> 4.0%   3.000000 1.000000
#> 4.25%  4.000000 1.000000
#> 4.50%  4.000000 2.000000
#> 4.75%  4.000000 2.000000
#> 4.100% 5.000000 2.000000
#> 6.0%   3.000000 1.000000
#> 6.25%  3.397399 2.192197
#> 6.50%  4.000000 4.000000
#> 6.75%  4.000000 4.000000
#> 6.100% 5.000000 6.000000
#> 8.0%   3.000000 2.000000
#> 8.25%  3.000000 2.686866
#> 8.50%  3.000000 4.000000
#> 8.75%  3.000000 4.000000
#> 8.100% 5.000000 8.000000
mtcars |> fgroup_by(cyl) |> fselect(-wt) |> BY(.quantile, w = mtcars$wt)
#>    cyl      mpg     disp        hp     drat     qsec vs am     gear     carb
#> 1    4 21.40000  71.1000  52.00000 3.690000 16.70000  0  0 3.000000 1.000000
#> 2    4 22.80000  80.2647  65.55695 3.783351 18.60116  1  0 4.000000 1.000000
#> 3    4 24.53116 119.7122  91.67897 3.996622 19.25457  1  1 4.000000 2.000000
#> 4    4 29.75889 126.2910  95.88316 4.109992 20.00040  1  1 4.000000 2.000000
#> 5    4 33.90000 146.7000 113.00000 4.930000 22.90000  1  1 5.000000 2.000000
#> 6    6 17.80000 145.0000 105.00000 2.760000 15.50000  0  0 3.000000 1.000000
#> 7    6 18.46561 160.0000 110.00000 3.280086 16.89266  0  0 3.397399 2.192197
#> 8    6 19.55289 167.6000 111.40513 3.900000 18.46134  1  0 4.000000 4.000000
#> 9    6 21.00000 202.1893 123.00000 3.913285 19.23991  1  1 4.000000 4.000000
#> 10   6 21.40000 258.0000 175.00000 3.920000 20.22000  1  1 5.000000 6.000000
#> 11   8 10.40000 275.8000 150.00000 2.760000 14.50000  0  0 3.000000 2.000000
#> 12   8 13.91543 302.2813 178.43433 3.049961 16.56843  0  0 3.000000 2.686866
#> 13   8 15.15267 352.3188 202.85476 3.085425 17.33849  0  0 3.000000 4.000000
#> 14   8 16.11036 425.1300 234.39181 3.222213 17.67353  0  0 3.000000 4.000000
#> 15   8 19.20000 472.0000 335.00000 4.220000 18.00000  0  1 5.000000 8.000000
mtcars |> fgroup_by(cyl) |> fsummarise(across(-wt, .quantile, w = wt))
#>    cyl      mpg     disp        hp     drat     qsec vs am     gear     carb
#> 1    4 21.40000  71.1000  52.00000 3.690000 16.70000  0  0 3.000000 1.000000
#> 2    4 22.80000  80.2647  65.55695 3.783351 18.60116  1  0 4.000000 1.000000
#> 3    4 24.53116 119.7122  91.67897 3.996622 19.25457  1  1 4.000000 2.000000
#> 4    4 29.75889 126.2910  95.88316 4.109992 20.00040  1  1 4.000000 2.000000
#> 5    4 33.90000 146.7000 113.00000 4.930000 22.90000  1  1 5.000000 2.000000
#> 6    6 17.80000 145.0000 105.00000 2.760000 15.50000  0  0 3.000000 1.000000
#> 7    6 18.46561 160.0000 110.00000 3.280086 16.89266  0  0 3.397399 2.192197
#> 8    6 19.55289 167.6000 111.40513 3.900000 18.46134  1  0 4.000000 4.000000
#> 9    6 21.00000 202.1893 123.00000 3.913285 19.23991  1  1 4.000000 4.000000
#> 10   6 21.40000 258.0000 175.00000 3.920000 20.22000  1  1 5.000000 6.000000
#> 11   8 10.40000 275.8000 150.00000 2.760000 14.50000  0  0 3.000000 2.000000
#> 12   8 13.91543 302.2813 178.43433 3.049961 16.56843  0  0 3.000000 2.686866
#> 13   8 15.15267 352.3188 202.85476 3.085425 17.33849  0  0 3.000000 4.000000
#> 14   8 16.11036 425.1300 234.39181 3.222213 17.67353  0  0 3.000000 4.000000
#> 15   8 19.20000 472.0000 335.00000 4.220000 18.00000  0  1 5.000000 8.000000