combine.levels.Rd
Combine Infrequent Levels of a Categorical Variable
combine.levels(
x,
minlev = 0.05,
m,
ord = is.ordered(x),
plevels = FALSE,
sep = ","
)
a factor, `ordered` factor, or numeric or character variable that will be turned into a `factor`
the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled `"OTHER"`. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When `ord=TRUE` combinations happen only for consecutive levels.
alternative to `minlev`, is the minimum number of observations in a cell before it will be combined with others
set to `TRUE` to treat `x` as if it were an ordered factor, which allows only consecutive levels to be combined
by default `combine.levels` pools low-frequency levels into a category named `OTHER` when `x` is not ordered and `ord=FALSE`. To instead name this category the concatenation of all the pooled level names, separated by a comma, set `plevels=TRUE`.
the separator for concatenating levels when `plevels=TRUE`
a factor variable, or if `ord=TRUE` an ordered factor variable
After turning `x` into a `factor` if it is not one already, combines levels of `x` whose frequency falls below a specified relative frequency `minlev` or absolute count `m`. When `x` is not treated as ordered, all of the small frequency levels are combined into `"OTHER"`, unless `plevels=TRUE`. When `ord=TRUE` or `x` is an ordered factor, only consecutive levels are combined. New levels are constructed by concatenating the levels with `sep` as a separator. This is useful when comparing ordinal regression with polytomous (multinomial) regression and there are too many categories for polytomous regression. `combine.levels` is also useful when assumptions of ordinal models are being checked empirically by computing exceedance probabilities for various cutoffs of the dependent variable.
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1))
combine.levels(x, m=3)
#> [1] OTHER B B B C C C C OTHER OTHER
#> Levels: OTHER B C
combine.levels(x, m=3, plevels=TRUE)
#> [1] A,D,E B B B C C C C A,D,E A,D,E
#> Levels: A,D,E B C
combine.levels(x, ord=TRUE, m=3)
#> [1] A,B A,B A,B A,B C,D,E C,D,E C,D,E C,D,E C,D,E C,D,E
#> Levels: A,B < C,D,E
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1),
rep('F',1))
combine.levels(x, ord=TRUE, m=3)
#> [1] A,B A,B A,B A,B C C C C D,E,F D,E,F D,E,F
#> Levels: A,B < C < D,E,F