Negative Binomial Distribution Family Function

Maximum likelihood estimation of the two parameters of a negative binomial distribution.

negbinomial(zero = "size", parallel = FALSE, deviance.arg = FALSE,
            type.fitted = c("mean", "quantiles"),
            percentiles = c(25, 50, 75), vfl = FALSE,
            mds.min = 1e-3, nsimEIM = 500, cutoff.prob = 0.999,
            eps.trig = 1e-7, max.support = 4000, max.chunk.MB = 30,
            lmu = "loglink", lsize = "loglink",
            imethod = 1, imu = NULL, iprobs.y = NULL,
            gprobs.y = ppoints(6), isize = NULL,
            gsize.mux = exp(c(-30, -20, -15, -10, -6:3)))
polya(zero = "size", type.fitted = c("mean", "prob"),
      mds.min = 1e-3, nsimEIM = 500, cutoff.prob = 0.999,
      eps.trig = 1e-7, max.support = 4000, max.chunk.MB = 30,
      lprob = "logitlink", lsize = "loglink", imethod = 1, iprob = NULL,
      iprobs.y = NULL, gprobs.y = ppoints(6), isize = NULL,
      gsize.mux = exp(c(-30, -20, -15, -10, -6:3)), imunb = NULL)
polyaR(zero = "size", type.fitted = c("mean", "prob"),
       mds.min = 1e-3, nsimEIM = 500,  cutoff.prob = 0.999,
       eps.trig = 1e-7, max.support = 4000, max.chunk.MB = 30,
       lsize = "loglink", lprob = "logitlink", imethod = 1, iprob = NULL,
       iprobs.y = NULL, gprobs.y = ppoints(6), isize = NULL,
       gsize.mux = exp(c(-30, -20, -15, -10, -6:3)), imunb = NULL)

Arguments

zero

Can be an integer-valued vector, and if so, then it is usually assigned $-2$ or $2$. Specifies which of the two linear/additive predictors are modelled as an intercept only. By default, the $k$ parameter (after lsize is applied) is modelled as a single unknown number that is estimated. It can be modelled as a function of the explanatory variables by setting zero = NULL; this has been called a NB-H model by Hilbe (2011). A negative value means that the value is recycled, so setting $-2$ means all $k$ are intercept-only. See CommonVGAMffArguments for more information.

lmu, lsize, lprob

Link functions applied to the $\mu$, $k$ and $p$ parameters. See Links for more choices. Note that the $\mu$, $k$ and $p$ parameters are the mu, size and prob arguments of rnbinom respectively. Common alternatives for lsize are negloglink and reciprocallink, and logloglink (if $k > 1$).

imu, imunb, isize, iprob

Optional initial values for the mean and $k$ and $p$. For $k$, if failure to converge occurs then try different values (and/or use imethod). For a $S$-column response, isize can be of length $S$. A value NULL means an initial value for each response is computed internally using a gridsearch based on gsize.mux. The last argument is ignored if used within cqo; see the iKvector argument of qrrvglm.control instead. In the future isize and iprob might be depreciated.

nsimEIM

This argument is used for computing the diagonal element of the expected information matrix (EIM) corresponding to $k$ based on the simulated Fisher scoring (SFS) algorithm. See CommonVGAMffArguments for more information and the notes below. SFS is one of two algorithms for computing the EIM elements (so that both algorithms may be used on a given data set). SFS is faster than the exact method when Qmax is large.

cutoff.prob

Fed into the p argument of qnbinom in order to obtain an upper limit for the approximate support of the distribution, called Qmax, say. Similarly, the value 1-p is fed into the p argument of qnbinom in order to obtain a lower limit for the approximate support of the distribution, called Qmin, say. Hence the approximate support is Qmin:Qmax. This argument should be a numeric and close to 1 but never exactly 1. Used to specify how many terms of the infinite series for computing the second diagonal element of the EIM are actually used. The closer this argument is to 1, the more accurate the standard errors of the regression coefficients will be. If this argument is too small, convergence will take longer.

max.chunk.MB, max.support

max.support is used to describe the eligibility of individual observations to have their EIM computed by the exact method. Here, we are concerned about computing the EIM wrt $k$. The exact method algorithm operates separately on each response variable, and it constructs a large matrix provided that the number of columns is less than max.support. If so, then the computations are done in chunks, so that no more than about max.chunk.MB megabytes of memory is used at a time (actually, it is proportional to this amount). Regarding eligibility of this algorithm, each observation must have the length of the vector, starting from the 1-cutoff.prob quantile and finishing up at the cutoff.prob quantile, less than max.support (as its approximate support). If you have abundant memory then you might try setting max.chunk.MB = Inf, but then the computations might take a very long time. Setting max.chunk.MB = 0 or max.support = 0 will force the EIM to be computed using the SFS algorithm only (this used to be the default method for all the observations). When the fitted values of the model are large and $k$ is small, the computation of the EIM will be costly with respect to time and memory if the exact method is used. Hence the argument max.support limits the cost in terms of time. For intercept-only models max.support is multiplied by a number (such as 10) because only one inner product needs be computed. Note: max.support is an upper bound and limits the number of terms dictated by the eps.trig argument.

mds.min

Numeric. Minimum value of the NBD mean divided by size parameter. The closer this ratio is to 0, the closer the distribution is to a Poisson. Iterations will stop when an estimate of $k$ is so large, relative to the mean, than it is below this threshold (this is treated as a boundary of the parameter space).

vfl

Logical. Fit the Variance–variance Factorized Loglinear (VFL) model? If TRUE then the constraint matrix rbind(0, -1) is assigned to all covariates which are not parallel. Hence parallel must be used in conjunction with this argument to specify the set of covariates used for modelling the mean. Note that the constraint matrix for the intercept should be parallel too. The overall resulting parameterization is the same as Evans (1953). Some general information is at CommonVGAMffArguments.

eps.trig

Numeric. A small positive value used in the computation of the EIMs. It focusses on the denominator of the terms of a series. Each term in the series (that is used to approximate an infinite series) has a value greater than size / sqrt(eps.trig), thus very small terms are ignored. It's a good idea to set a smaller value that will result in more accuracy, but it will require a greater computing time (when $k$ is close to 0). And adjustment to max.support may be needed. In particular, the quantity computed by special means is $\psi'(k) - E[\psi'(Y+k)]$, which is the difference between two trigamma. functions. It is part of the calculation of the EIM with respect to the size parameter.

gsize.mux

Similar to gsigma in CommonVGAMffArguments. However, this grid is multiplied by the initial estimates of the NBD mean parameter. That is, it is on a relative scale rather than on an absolute scale. If the counts are very large in value then convergence fail might occur; if so, then try a smaller value such as gsize.mux = exp(-40).

type.fitted, percentiles

See CommonVGAMffArguments for more information.

deviance.arg

Logical. If TRUE, the deviance is computed after convergence. It only works in the NB-2 model. It is also necessary to set criterion = "coefficients" or half.step = FALSE since one cannot use that criterion properly for the minimization within the IRLS algorithm. It should be set TRUE when used with cqo under the fast algorithm.

imethod

An integer with value 1 or 2 etc. which specifies the initialization method for the $\mu$ parameter. If failure to converge occurs try another value and/or else specify a value for iprobs.y and/or else specify a value for isize.

parallel

Setting parallel = TRUE is useful in order to get something similar to quasipoisson or what is known as NB-1. If parallel = TRUE then the parallelism constraint does not apply to any intercept term. You should set zero = NULL too if parallel = TRUE to avoid a conflict. See CommonVGAMffArguments for more information. Argument vfl requires the use of parallel to fit the VFL model.

gprobs.y

A vector representing a grid; passed into the probs argument of quantile when imethod = 1 to obtain an initial value for the mean of each response. Is overwritten by any value of iprobs.y.

iprobs.y

Passed into the probs argument of quantile when imethod = 1 to obtain an initial value for the mean of each response. Overwrites any value of gprobs.y. This argument might be deleted in the future.

Details

The negative binomial distribution (NBD) can be motivated in several ways, e.g., as a Poisson distribution with a mean that is gamma distributed. There are several common parametrizations of the NBD. The one used by negbinomial() uses the mean $\mu$ and an index parameter $k$, both which are positive. Specifically, the density of a random variable $Y$ is $$f(y;\mu,k) = {y + k - 1 \choose y} \, \left( \frac{\mu}{\mu+k} \right)^y\, \left( \frac{k}{k+\mu} \right)^k $$ where $y=0,1,2,\ldots$, and $\mu > 0$ and $k > 0$. Note that the dispersion parameter is $1/k$, so that as $k$ approaches infinity the NBD approaches a Poisson distribution. The response has variance $Var(Y)=\mu+\mu^2/k$. When fitted, the fitted.values slot of the object contains the estimated value of the $\mu$ parameter, i.e., of the mean $E(Y)$. It is common for some to use $\alpha=1/k$ as the ancillary or heterogeneity parameter; so common alternatives for lsize are negloglink and reciprocallink.

For polya the density is $$f(y;p,k) = {y + k - 1 \choose y} \, \left( 1 - p \right)^y\, p^k $$ where $y=0,1,2,\ldots$, and $k > 0$ and $0 < p < 1$.

Family function polyaR() is the same as polya() except the order of the two parameters are switched. The reason is that polyaR() tries to match with rnbinom closely in terms of the argument order, etc. Should the probability parameter be of primary interest, probably, users will prefer using polya() rather than polyaR(). Possibly polyaR() will be decommissioned one day.

The NBD can be coerced into the classical GLM framework with one of the parameters being of interest and the other treated as a nuisance/scale parameter (this is implemented in the MASS library). The VGAM family function negbinomial() treats both parameters on the same footing, and estimates them both by full maximum likelihood estimation.

The parameters $\mu$ and $k$ are independent (diagonal EIM), and the confidence region for $k$ is extremely skewed so that its standard error is often of no practical use. The parameter $1/k$ has been used as a measure of aggregation. For the NB-C the EIM is not diagonal.

These VGAM family functions handle multiple responses, so that a response matrix can be inputted. The number of columns is the number of species, say, and setting zero = -2 means that all species have a $k$ equalling a (different) intercept only.

Conlisk, et al. (2007) show that fitting the NBD to presence-absence data will result in identifiability problems. However, the model is identifiable if the response values include 0, 1 and 2.

For the NB canonical link (NB-C), its estimation has a somewhat interesting history. Some details are at nbcanlink.

Warning

Poisson regression corresponds to $k$ equalling infinity. If the data is Poisson or close to Poisson, numerical problems may occur. Some corrective measures are taken, e.g., $k$ is effectively capped (relative to the mean) during estimation to some large value and a warning is issued. And setting stepsize = 0.5 for half stepping is probably a good idea too when the data is extreme.

The NBD is a strictly unimodal distribution. Any data set that does not exhibit a mode (somewhere in the middle) makes the estimation problem difficult. Set trace = TRUE to monitor convergence.

These functions are fragile; the maximum likelihood estimate of the index parameter is fraught (see Lawless, 1987). Other alternatives to negbinomial are to fit a NB-1 or RR-NB (aka NB-P) model; see Yee (2014). Also available are the NB-C, NB-H and NB-G. Assigning values to the isize argument may lead to a local solution, and smaller values are preferred over large values when using this argument.

If one wants to force SFS to be used on all observations, then set max.support = 0 or max.chunk.MB = 0. If one wants to force the exact method to be used for all observations, then set max.support = Inf. If the computer has much memory, then trying max.chunk.MB = Inf and max.support = Inf may provide a small speed increase. If SFS is used at all, then the working weights (@weights) slot of the fitted object will be a matrix; otherwise that slot will be a 0 x 0 matrix.

An alternative to the NBD is the generalized Poisson distribution, genpoisson1, genpoisson2 and genpoisson0, since that also handles overdispersion wrt Poisson. It has one advantage in that its EIM can be computed straightforwardly.

Yet to do: write a family function which uses the methods of moments estimator for $k$.

Value

An object of class "vglmff" (see vglmff-class). The object is used by modelling functions such as vglm, rrvglm and vgam.

References

Bliss, C. and Fisher, R. A. (1953). Fitting the negative binomial distribution to biological data. Biometrics 9, 174–200.

Conlisk, E. and Conlisk, J. and Harte, J. (2007). The impossibility of estimating a negative binomial clustering parameter from presence-absence data: A comment on He and Gaston. The American Naturalist 170, 651–654.

Evans, D. A. (1953). Experimental evidence concerning contagious distributions in ecology. Biometrika, 40(1–2), 186–211.

Hilbe, J. M. (2011). Negative Binomial Regression, 2nd Edition. Cambridge: Cambridge University Press.

Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. The Canadian Journal of Statistics 15, 209–225.

Miranda-Soberanis, V. F. and Yee, T. W. (2023). Two-parameter link functions, with applications to negative binomial, Weibull and quantile regression. Computational Statistics, 38, 1463–1485.

Yee, T. W. (2014). Reduced-rank vector generalized linear models with two linear predictors. Computational Statistics and Data Analysis, 71, 889–902.

Yee, T. W. (2020). The VGAM package for negative binomial regression. Australian & New Zealand Journal of Statistics, 62, 116–131.

Author

Thomas W. Yee, and with a lot of help by Victor Miranda to get it going with nbcanlink.

Note

These 3 functions implement 2 common parameterizations of the negative binomial (NB). Some people called the NB with integer $k$ the Pascal distribution, whereas if $k$ is real then this is the Polya distribution. I don't. The one matching the details of rnbinom in terms of $p$ and $k$ is polya().

For polya() the code may fail when $p$ is close to 0 or 1. It is not yet compatible with cqo or cao.

Suppose the response is called ymat. For negbinomial() the diagonal element of the expected information matrix (EIM) for parameter $k$ involves an infinite series; consequently SFS (see nsimEIM) is used as the backup algorithm only. SFS should be better if max(ymat) is large, e.g., max(ymat) > 1000, or if there are any outliers in ymat. The default algorithm involves a finite series approximation to the support 0:Inf; the arguments max.memory, min.size and cutoff.prob are pertinent.

Regardless of the algorithm used, convergence problems may occur, especially when the response has large outliers or is large in magnitude. If convergence failure occurs, try using arguments (in recommended decreasing order) max.support, nsimEIM, cutoff.prob, iprobs.y, imethod, isize, zero, max.chunk.MB.

The function negbinomial can be used by the fast algorithm in cqo, however, setting eq.tolerances = TRUE and I.tolerances = FALSE is recommended.

In the first example below (Bliss and Fisher, 1953), from each of 6 McIntosh apple trees in an orchard that had been sprayed, 25 leaves were randomly selected. On each of the leaves, the number of adult female European red mites were counted.

There are two special uses of negbinomial for handling count data. Firstly, when used by rrvglm this results in a continuum of models in between and inclusive of quasi-Poisson and negative binomial regression. This is known as a reduced-rank negative binomial model (RR-NB). It fits a negative binomial log-linear regression with variance function $Var(Y)=\mu+\delta_1 \mu^{\delta_2}$ where $\delta_1$ and $\delta_2$ are parameters to be estimated by MLE. Confidence intervals are available for $\delta_2$, therefore it can be decided upon whether the data are quasi-Poisson or negative binomial, if any.

Secondly, the use of negbinomial with parallel = TRUE inside vglm can result in a model similar to quasipoisson. This is named the NB-1 model. The dispersion parameter is estimated by MLE whereas glm uses the method of moments. In particular, it fits a negative binomial log-linear regression with variance function $Var(Y) = \phi_0 \mu$ where $\phi_0$ is a parameter to be estimated by MLE. Confidence intervals are available for $\phi_0$.

Examples

if (FALSE) { # \dontrun{
# Example 1: apple tree data (Bliss and Fisher, 1953)
appletree <- data.frame(y = 0:7, w = c(70, 38, 17, 10, 9, 3, 2, 1))
fit <- vglm(y ~ 1, negbinomial(deviance = TRUE), data = appletree,
            weights = w, crit = "coef")  # Obtain the deviance
fit <- vglm(y ~ 1, negbinomial(deviance = TRUE), data = appletree,
            weights = w, half.step = FALSE)  # Alternative method
summary(fit)
coef(fit, matrix = TRUE)
Coef(fit)  # For intercept-only models
deviance(fit)  # NB2 only; needs 'crit="coef"' & 'deviance=T' above

# Example 2: simulated data with multiple responses
ndata <- data.frame(x2 = runif(nn <- 200))
ndata <- transform(ndata, y1 = rnbinom(nn, exp(1), mu = exp(3+x2)),
                          y2 = rnbinom(nn, exp(0), mu = exp(2-x2)))
fit1 <- vglm(cbind(y1, y2) ~ x2, negbinomial, ndata, trace = TRUE)
coef(fit1, matrix = TRUE)

# Example 3: large counts implies SFS is used
ndata <- transform(ndata, y3 = rnbinom(nn, exp(1), mu = exp(10+x2)))
with(ndata, range(y3))  # Large counts
fit2 <- vglm(y3 ~ x2, negbinomial, data = ndata, trace = TRUE)
coef(fit2, matrix = TRUE)
head(weights(fit2, type = "working"))  # Non-empty; SFS was used

# Example 4: a NB-1 to estimate a NB with Var(Y)=phi0*mu
nn <- 200  # Number of observations
phi0 <- 10  # Specify this; should be greater than unity
delta0 <- 1 / (phi0 - 1)
mydata <- data.frame(x2 = runif(nn), x3 = runif(nn))
mydata <- transform(mydata, mu = exp(2 + 3 * x2 + 0 * x3))
mydata <- transform(mydata, y3 = rnbinom(nn, delta0 * mu, mu = mu))
plot(y3 ~ x2, data = mydata, pch = "+", col = "blue",
     main = paste("Var(Y) = ", phi0, " * mu", sep = ""), las = 1)
nb1 <- vglm(y3 ~ x2 + x3, negbinomial(parallel = TRUE, zero = NULL),
            data = mydata, trace = TRUE)
# Extracting out some quantities:
cnb1 <- coef(nb1, matrix = TRUE)
mydiff <- (cnb1["(Intercept)", "loglink(size)"] -
           cnb1["(Intercept)", "loglink(mu)"])
delta0.hat <- exp(mydiff)
(phi.hat <- 1 + 1 / delta0.hat)  # MLE of phi
summary(nb1)
# Obtain a 95 percent confidence interval for phi0:
myvec <- rbind(-1, 1, 0, 0)
(se.mydiff <- sqrt(t(myvec) %*%  vcov(nb1) %*%  myvec))
ci.mydiff <- mydiff + c(-1.96, 1.96) * c(se.mydiff)
ci.delta0 <- ci.exp.mydiff <- exp(ci.mydiff)
(ci.phi0 <- 1 + 1 / rev(ci.delta0))  # The 95
Confint.nb1(nb1)  # Quick way to get it
# cf. moment estimator:
summary(glm(y3 ~ x2 + x3, quasipoisson, mydata))$disper
} # }