Hypergeometric Family Function

Family function for a hypergeometric distribution where either the number of white balls or the total number of white and black balls are unknown.

hyperg(N = NULL, D = NULL, lprob = "logitlink", iprob = NULL)

Arguments

N

Total number of white and black balls in the urn. Must be a vector with positive values, and is recycled, if necessary, to the same length as the response. One of N and D must be specified.

D

Number of white balls in the urn. Must be a vector with positive values, and is recycled, if necessary, to the same length as the response. One of N and D must be specified.

lprob

Link function for the probabilities. See Links for more choices.

iprob

Optional initial value for the probabilities. The default is to choose initial values internally.

Details

Consider the scenario from dhyper where there are \(N=m+n\) balls in an urn, where \(m\) are white and \(n\) are black. A simple random sample (i.e., without replacement) of \(k\) balls is taken. The response here is the sample proportion of white balls. In this document, N is \(N=m+n\), D is \(m\) (for the number of “defectives”, in quality control terminology, or equivalently, the number of marked individuals). The parameter to be estimated is the population proportion of white balls, viz. \(prob = m/(m+n)\).

Depending on which one of N and D is inputted, the estimate of the other parameter can be obtained from the equation \(prob = m/(m+n)\), or equivalently, prob = D/N. However, the log-factorials are computed using lgamma and both \(m\) and \(n\) are not restricted to being integer. Thus if an integer \(N\) is to be estimated, it will be necessary to evaluate the likelihood function at integer values about the estimate, i.e., at trunc(Nhat) and ceiling(Nhat) where Nhat is the (real) estimate of \(N\).

Value

An object of class "vglmff" (see vglmff-class). The object is used by modelling functions such as vglm, vgam, rrvglm, cqo, and cao.

References

Forbes, C., Evans, M., Hastings, N. and Peacock, B. (2011). Statistical Distributions, Hoboken, NJ, USA: John Wiley and Sons, Fourth edition.

Author

Thomas W. Yee

Note

The response can be of one of three formats: a factor (first level taken as success), a vector of proportions of success, or a 2-column matrix (first column = successes) of counts. The argument weights in the modelling function can also be specified. In particular, for a general vector of proportions, you will need to specify weights because the number of trials is needed.

Warning

No checking is done to ensure that certain values are within range, e.g., \(k \leq N\).

Examples

nn <- 100
m <- 5  # Number of white balls in the population
k <- rep(4, len = nn)  # Sample sizes
n <- 4  # Number of black balls in the population
y  <- rhyper(nn = nn, m = m, n = n, k = k)
yprop <- y / k  # Sample proportions

# N is unknown, D is known. Both models are equivalent:
fit <- vglm(cbind(y,k-y) ~ 1, hyperg(D = m), trace = TRUE, crit = "c")
#> Iteration 1: coefficients = 0.26074287
#> Iteration 2: coefficients = 0.20021335
#> Iteration 3: coefficients = 0.19767122
#> Iteration 4: coefficients = 0.19766743
#> Iteration 5: coefficients = 0.19766743
fit <- vglm(yprop ~ 1, hyperg(D = m), weight = k, trace = TRUE, crit = "c")
#> Iteration 1: coefficients = 0.26074287
#> Iteration 2: coefficients = 0.20021335
#> Iteration 3: coefficients = 0.19767122
#> Iteration 4: coefficients = 0.19766743
#> Iteration 5: coefficients = 0.19766743

# N is known, D is unknown. Both models are equivalent:
fit <- vglm(cbind(y, k-y) ~ 1, hyperg(N = m+n), trace = TRUE, crit = "l")
#> Iteration 1: loglikelihood = 533.49063
#> Iteration 2: loglikelihood = 533.614
#> Iteration 3: loglikelihood = 533.614
fit <- vglm(yprop ~ 1, hyperg(N = m+n), weight = k, trace = TRUE, crit = "l")
#> Iteration 1: loglikelihood = 533.49063
#> Iteration 2: loglikelihood = 533.614
#> Iteration 3: loglikelihood = 533.614

coef(fit, matrix = TRUE)
#>             logitlink(prob)
#> (Intercept)       0.1861458
Coef(fit)  # Should be equal to the true population proportion
#>      prob 
#> 0.5464025 
unique(m / (m+n))  # The true population proportion
#> [1] 0.5555556
fit@extra
#> $Nvector
#> [1] 9
#> 
#> $Nunknown
#> [1] FALSE
#> 
head(fitted(fit))
#>        [,1]
#> 1 0.5464025
#> 2 0.5464025
#> 3 0.5464025
#> 4 0.5464025
#> 5 0.5464025
#> 6 0.5464025
summary(fit)
#> 
#> Call:
#> vglm(formula = yprop ~ 1, family = hyperg(N = m + n), weights = k, 
#>     trace = TRUE, crit = "l")
#> 
#> Coefficients: 
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.18615    0.04013   4.638 3.52e-06 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Name of linear predictor: logitlink(prob) 
#> 
#> Log-likelihood: 533.614 on 99 degrees of freedom
#> 
#> Number of Fisher scoring iterations: 3 
#> 
#> No Hauck-Donner effect found in any of the estimates
#>