Stochastic Gradient Ascent

Stochastic Gradient Ascent–based optimizers

maxSGA(fn = NULL, grad = NULL, hess = NULL, start,
       nObs,
       constraints = NULL, finalHessian = FALSE, 
       fixed = NULL, control=NULL, ... )
maxAdam(fn = NULL, grad = NULL, hess = NULL, start,
        nObs,
        constraints = NULL, finalHessian = FALSE, 
        fixed = NULL, control=NULL, ... )

Arguments

fn

the function to be maximized. As the objective function values are not directly used for optimization, this argument is optional, given grad is provided. It must have the parameter vector as the first argument, and it must have an argument index to specify the integer index of the selected observations. It must return either a single number, or a numeric vector (this is is summed internally). If the parameters are out of range, fn should return NA. See details for constant parameters.

fn may also return attributes "gradient" and/or "hessian". If these attributes are set, the algorithm uses the corresponding values as gradient and Hessian.

grad

gradient of the objective function. It must have the parameter vector as the first argument, and it must have an argument index to specify the integer index of selected observations. It must return either a gradient vector of the objective function, or a matrix, where columns correspond to individual parameters. The column sums are treated as gradient components. If NULL, finite-difference gradients are computed. If fn returns an object with attribute gradient, this argument is ignored.

If grad is not supplied, it is computed by finite-difference method using fn. However, this is only adviseable for small-scale tests, not for any production run. Obviously, fn must be correctly defined in that case.

hess

Hessian matrix of the function. Mainly for compatibility reasons, only used for computing the final Hessian if asked to do so by setting finalHessian to TRUE. It must have the parameter vector as the first argument and it must return the Hessian matrix of the objective function. If missing, either finite-difference Hessian, based on gradient or BHHH approach is computed if asked to do so.

start

initial parameter values. If these have names, the names are also used for results.

nObs

number of observations. This is used to partition the data into individual batches. The resulting batch indices are forwarded to the grad function through the argument index.

constraints

either NULL for unconstrained optimization or a list with two components. The components may be either eqA and eqB for equality-constrained optimization \(A \theta + B = 0\); or ineqA and ineqB for inequality constraints \(A \theta + B > 0\). More than one row in ineqA and ineqB corresponds to more than one linear constraint, in that case all these must be zero (equality) or positive (inequality constraints). The equality-constrained problem is forwarded to sumt, the inequality-constrained case to constrOptim2.

finalHessian

how (and if) to calculate the final Hessian. Either FALSE (do not calculate), TRUE (use analytic/finite-difference Hessian) or "bhhh"/"BHHH" for the information equality approach. The latter approach is only suitable when working with a log-likelihood function, and it requires the gradient/log-likelihood to be supplied by individual observations.

Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finite-difference method, the Hessian computation may be very slow.

fixed

parameters to be treated as constants at their start values. If present, it is treated as an index vector of start parameters.

control

list of control parameters. The ones used by these optimizers are

SGA_momentum

0, numeric momentum parameter for SGA. Must lie in interval \([0,1]\). See details.

Adam_momentum1

0.9, numeric in interval \((0,1)\), the first moment momentum

Adam_momentum2

0.999, numeric in interval \((0,1)\), the second moment momentum

SG_learningRate

step size the SGA algorithm takes in the gradient direction. If 1, the step equals to the gradient value. A good value is often 0.01–0.3

SG_batchSize

SGA batch size, an integer between 1 and nObs. If NULL (default), the full batch gradient is computed.

SG_clip

NULL, gradient clipping threshold. The algorithm ensures that \(||g(\theta)||_2^2 \le \kappa\) where \(\kappa\) is the SG_clip value. If the actual norm of the gradient exceeds (square root of) \(\kappa\), the gradient will be scaled back accordingly while preserving its direction. NULL means no clipping.

gradtol

stopping condition. Stop if norm of the gradient is less than gradtol. Default 0, i.e. do not use this condition. This condition is useful if the objective is to drive full batch gradient to zero on training data. It is not a good objective in case of the stochastic gradient, and if the objective is to optimize the objective on validation data.

SG_patience

NULL, or integer. Stopping condition: the algorithm counts how many times the objective function has been worse than its best value so far, and if this exceeds SG_patience, the algorithm stops.

SG_patienceStep

1L, integer. After how many epochs to check the patience value. 1 means to check at each epoch, and hence to compute the objective function. This may be undesirable if the objective function is costly to compute.

iterlim

stopping condition. Stop if more than iterlim epochs, return code=4. Epoch is a set of iterations that cycles through all observations. In case of full batch, iterations and epochs are equivalent. If iterlim = 0, does not do any learning and returns the initial values unchanged.

printLevel

this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every epoch. Higher values will result in even more output.

storeParameters

logical, whether to store and return the parameter values at each epoch. If TRUE, the stored values can be retrieved with storedParameters-method. The parameters are stored as a matrix with rows corresponding to the epochs and columns to the parameter components. There are iterlim + 1 rows, where the first one corresponds to the initial parameters.

Default FALSE.

storeValues

logical, whether to store and return the objective function values at each epoch. If TRUE, the stored values can be retrieved with storedValues-method. There are iterlim + 1 values, where the first one corresponds to the value at the initial parameters.

Default FALSE.

See maxControl for more information.

...

further arguments to fn, grad and hess. To maintain compatibility with the earlier versions, ... also passes certain control options to the optimizers.

Details

Gradient Ascent (GA) is a optimization method where the algorithm repeatedly takes small steps in the gradient's direction, the parameter vector \(\theta\) is updated as \(\theta \leftarrow theta + \mathrm{learning rate}\cdot \nabla f(\theta)\). In case of Stochastic GA (SGA), the gradient is not computed on the full set of observations but on a small subset, batch, potentially a single observation only. In certain circumstances this converges much faster than when using all observation (see Bottou et al, 2018).

If SGA_momentum is positive, the SGA algorithm updates the parameters \(\theta\) in two steps. First, the momentum is used to update the “velocity” \(v\) as \(v \leftarrow \mathrm{momentum}\cdot v + \mathrm{learning rate}\cdot \nabla f(\theta)\), and thereafter the parameter \(\theta\) is updates as \(\theta \leftarrow \theta + v\). Initial velocity is set to 0.

The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.

The function fn is not directly used for optimization, only for printing or as a stopping condition. In this sense it is up to the user to decide what the function returns, if anything. For instance, it may be useful for fn to compute the objective function on either full training data, or on validation data, and just ignore the index argument. The latter is useful if using patience-based stopping. However, one may also choose to select the observations determined by the index to compute the objective function on the current data batch.

Value

object of class "maxim". Data can be extracted through the following methods:

maxValue: fn value at maximum (the last calculated value if not converged.)
coef: estimated parameter value.
gradient: vector, last calculated gradient value. Should be close to 0 in case of normal convergence.
estfun: matrix of gradients at parameter value estimate evaluated at each observation (only if grad returns a matrix or grad is not specified and fn returns a vector).
hessian: Hessian at the maximum (the last calculated value if not converged).
storedValues: return values stored at each epoch
storedParameters: return parameters stored at each epoch
returnCode: a numeric code that describes the convergence or error.
returnMessage: a short message, describing the return code.
activePar: logical vector, which parameters are optimized over. Contains only TRUE-s if no parameters are fixed.
nIter: number of iterations.
maximType: character string, type of maximization.
maxControl: the optimization control parameters in the form of a MaxControl object.

References

Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for Large-Scale Machine Learning SIAM Review, 2018, 60, 223–311.

Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press

Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458

Author

Ott Toomet, Arne Henningsen

Examples

## estimate the exponential distribution parameter by ML
set.seed(1)
t <- rexp(100, 2)
loglik <- function(theta, index) sum(log(theta) - theta*t[index])
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta, index) sum(1/theta - t[index])
## Estimate with full-batch
a <- maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
            SG_batchSize=10), nObs=100)
            # note that loglik is not really needed, and is not used
            # here, unless more print verbosity is asked
summary(a)
#> --------------------------------------------
#> Stochastic Gradient Ascent 
#> Number of iterations: 1000 
#> Return code: 4 
#> Iteration limit exceeded (iterlim) 
#> Function value: 
#> Estimates:
#>      estimate   gradient
#> [1,] 2.099567 -0.1279617
#> --------------------------------------------
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik <- function(beta, index) {
   e <- yValid - XValid %*% beta
   -crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik <- function(beta, index)  {
   e <- yTrain[index] - XTrain[index,,drop=FALSE] %*% beta
   g <- t(-2*t(XTrain[index,,drop=FALSE]) %*% e)
   -g/length(index)
}
N <- 1000
## two random variables: one with scale 1, the other with 100
X <- cbind(rnorm(N), rnorm(N, sd=100))
beta <- c(1, 1)  # true parameter values
y <- X %*% beta + rnorm(N, sd=0.2)
## training-validation split
iTrain <- sample(N, 0.8*N)
XTrain <- X[iTrain,,drop=FALSE]
XValid <- X[-iTrain,,drop=FALSE]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat("  No momentum:\n")
#>   No momentum:
a <- maxSGA(loglik, gradlik, start=c(10,10),
           control=list(printLevel=1, iterlim=50,
                        SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
                        ), nObs=length(yTrain))
#> Initial function value: -199008.4 
#> --------------
#> Iteration limit exceeded (iterlim) 
#> 50  iterations
#> estimate: 8.05622774286993, 1.56881067557959 
#> Function value: -816.6898 
print(summary(a))  # the first component is off, the second one is close to the true value
#> --------------------------------------------
#> Stochastic Gradient Ascent 
#> Number of iterations: 50 
#> Return code: 4 
#> Iteration limit exceeded (iterlim) 
#> Function value: -816.6898 
#> Estimates:
#>      estimate     gradient
#> [1,] 8.056228    -22.18443
#> [2,] 1.568811 -10806.06327
#> --------------------------------------------
## do with momentum 0.99
cat("  Momentum 0.99:\n")
#>   Momentum 0.99:
a <- maxSGA(loglik, gradlik, start=c(10,10),
           control=list(printLevel=1, iterlim=50,
                        SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
                        # no momentum
                        ), nObs=length(yTrain))
#> Initial function value: -199008.4 
#> --------------
#> Iteration limit exceeded (iterlim) 
#> 50  iterations
#> estimate: 8.02519802176179e+23, 8.64902046641093e+25 
#> Function value: -1.835145e+55 
print(summary(a))  # close to true value
#> --------------------------------------------
#> Stochastic Gradient Ascent 
#> Number of iterations: 50 
#> Return code: 4 
#> Iteration limit exceeded (iterlim) 
#> Function value: -1.835145e+55 
#> Estimates:
#>          estimate      gradient
#> [1,] 8.025198e+23  2.210386e+27
#> [2,] 8.649020e+25 -1.822211e+30
#> --------------------------------------------