combine.levels — combine.levels • Hmisc

Combine Infrequent Levels of a Categorical Variable

combine.levels(
  x,
  minlev = 0.05,
  m,
  ord = is.ordered(x),
  plevels = FALSE,
  sep = ","
)

Arguments

x: a factor, `ordered` factor, or numeric or character variable that will be turned into a `factor`
minlev: the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled `"OTHER"`. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When `ord=TRUE` combinations happen only for consecutive levels.
m: alternative to `minlev`, is the minimum number of observations in a cell before it will be combined with others
ord: set to `TRUE` to treat `x` as if it were an ordered factor, which allows only consecutive levels to be combined
plevels: by default `combine.levels` pools low-frequency levels into a category named `OTHER` when `x` is not ordered and `ord=FALSE`. To instead name this category the concatenation of all the pooled level names, separated by a comma, set `plevels=TRUE`.
sep: the separator for concatenating levels when `plevels=TRUE`

Value

a factor variable, or if `ord=TRUE` an ordered factor variable

Details

After turning `x` into a `factor` if it is not one already, combines levels of `x` whose frequency falls below a specified relative frequency `minlev` or absolute count `m`. When `x` is not treated as ordered, all of the small frequency levels are combined into `"OTHER"`, unless `plevels=TRUE`. When `ord=TRUE` or `x` is an ordered factor, only consecutive levels are combined. New levels are constructed by concatenating the levels with `sep` as a separator. This is useful when comparing ordinal regression with polytomous (multinomial) regression and there are too many categories for polytomous regression. `combine.levels` is also useful when assumptions of ordinal models are being checked empirically by computing exceedance probabilities for various cutoffs of the dependent variable.

Author

Frank Harrell

Examples

x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1))
combine.levels(x, m=3)
#>  [1] OTHER B     B     B     C     C     C     C     OTHER OTHER
#> Levels: OTHER B C
combine.levels(x, m=3, plevels=TRUE)
#>  [1] A,D,E B     B     B     C     C     C     C     A,D,E A,D,E
#> Levels: A,D,E B C
combine.levels(x, ord=TRUE, m=3)
#>  [1] A,B   A,B   A,B   A,B   C,D,E C,D,E C,D,E C,D,E C,D,E C,D,E
#> Levels: A,B < C,D,E
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1),
       rep('F',1))
combine.levels(x, ord=TRUE, m=3)
#>  [1] A,B   A,B   A,B   A,B   C     C     C     C     D,E,F D,E,F D,E,F
#> Levels: A,B < C < D,E,F