maxSGA.RdStochastic Gradient Ascent–based optimizers
maxSGA(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )
maxAdam(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )the function to be maximized. As the objective function
values are not directly used for optimization, this argument is
optional, given grad is provided.
It must have the parameter vector as the first argument, and it must
have an argument index to specify the integer index of the selected
observations.
It must return either a single number, or a numeric vector (this is
is summed internally).
If the parameters are out of range, fn should
return NA. See details for constant parameters.
fn may also return attributes "gradient" and/or "hessian".
If these attributes are set, the algorithm uses the corresponding
values as
gradient and Hessian.
gradient of the objective function.
It must have the parameter vector as the first argument, and it must
have an argument index to specify the integer index of selected
observations.
It must return either a gradient vector of the objective function,
or a matrix, where columns correspond to individual parameters.
The column sums are treated as gradient components.
If NULL, finite-difference gradients are computed.
If fn returns an object with attribute gradient,
this argument is ignored.
If grad is not supplied, it is computed by finite-difference
method using fn. However, this is only adviseable for
small-scale tests, not for any production run. Obviously, fn
must be correctly defined in that case.
Hessian matrix of the function. Mainly for compatibility
reasons, only used for computing the final Hessian if asked to do
so by setting finalHessian to TRUE.
It must have the parameter vector as the first argument and
it must return the Hessian matrix of the objective function.
If missing, either finite-difference Hessian, based on
gradient or BHHH approach
is computed if asked to do so.
initial parameter values. If these have names, the names are also used for results.
number of observations. This is used to partition the data
into individual batches. The resulting batch
indices are forwarded to the grad function through the
argument index.
either NULL for unconstrained optimization
or a list with two components. The components may be either
eqA and eqB for equality-constrained optimization
\(A \theta + B = 0\); or ineqA and
ineqB for inequality constraints \(A \theta + B > 0\). More
than one
row in ineqA and ineqB corresponds to more than
one linear constraint, in that case all these must be zero
(equality) or positive (inequality constraints).
The equality-constrained problem is forwarded
to sumt, the inequality-constrained case to
constrOptim2.
how (and if) to calculate the final Hessian. Either
FALSE (do not calculate), TRUE (use analytic/finite-difference
Hessian) or "bhhh"/"BHHH" for the information equality
approach. The latter approach is only suitable when working with a
log-likelihood function, and it requires the gradient/log-likelihood to
be supplied by individual observations.
Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finite-difference method, the Hessian computation may be very slow.
parameters to be treated as constants at their
start values. If present, it is treated as an index vector of
start parameters.
list of control parameters. The ones used by these optimizers are
0, numeric momentum parameter for SGA. Must lie in interval \([0,1]\). See details.
0.9, numeric in interval \((0,1)\), the first moment momentum
0.999, numeric in interval \((0,1)\), the second moment momentum
step size the SGA algorithm takes in the gradient direction. If 1, the step equals to the gradient value. A good value is often 0.01–0.3
SGA batch size, an integer between 1 and
nObs.
If NULL (default), the full batch gradient is computed.
NULL, gradient clipping threshold. The
algorithm ensures that \(||g(\theta)||_2^2 \le
\kappa\) where \(\kappa\) is
the SG_clip value. If the
actual norm of the gradient exceeds (square root of)
\(\kappa\),
the gradient will be scaled back accordingly while
preserving its direction. NULL means no clipping.
stopping condition. Stop if norm of the gradient is
less than gradtol. Default 0, i.e. do not use this
condition. This condition is useful if the
objective is to drive full batch gradient to zero on training data.
It is not a good objective in case of the stochastic
gradient, and if the objective is to optimize the objective on
validation data.
NULL, or integer. Stopping condition:
the algorithm counts how many times
the objective function has been worse than its best value so
far, and if this exceeds SG_patience, the algorithm stops.
1L, integer. After how many epochs to check
the patience value. 1 means to check at each epoch, and hence to compute the
objective function. This may be undesirable if the objective
function is costly to compute.
stopping condition. Stop if more than iterlim
epochs, return code=4.
Epoch is a set of iterations that cycles through all
observations. In case of full batch, iterations and epochs are
equivalent. If iterlim = 0, does not do any learning and
returns the initial values unchanged.
this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every epoch. Higher values will result in even more output.
logical, whether to store and return the
parameter
values at each epoch. If TRUE, the stored values
can be retrieved with storedParameters-method. The
parameters are stored as a matrix with rows corresponding to the
epochs and columns to the parameter components. There are
iterlim + 1 rows, where the first one corresponds to the
initial parameters.
Default FALSE.
logical, whether to store and return the objective
function values at each epoch. If TRUE, the stored values
can be retrieved with storedValues-method. There are
iterlim + 1 values, where the first one corresponds to
the value at the
initial parameters.
Default FALSE.
See maxControl for more information.
further arguments to fn, grad and
hess.
To maintain compatibility with the earlier versions, ... also
passes certain control options to the optimizers.
Gradient Ascent (GA) is a optimization method where the algorithm repeatedly takes small steps in the gradient's direction, the parameter vector \(\theta\) is updated as \(\theta \leftarrow theta + \mathrm{learning rate}\cdot \nabla f(\theta)\). In case of Stochastic GA (SGA), the gradient is not computed on the full set of observations but on a small subset, batch, potentially a single observation only. In certain circumstances this converges much faster than when using all observation (see Bottou et al, 2018).
If SGA_momentum is positive, the SGA algorithm updates the parameters
\(\theta\) in two steps. First, the momentum is used to update
the “velocity” \(v\) as
\(v \leftarrow \mathrm{momentum}\cdot v + \mathrm{learning
rate}\cdot \nabla f(\theta)\), and thereafter the parameter
\(\theta\) is updates as
\(\theta \leftarrow \theta + v\). Initial
velocity is set to 0.
The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.
The function fn is not directly used for optimization, only
for printing or as a stopping condition. In this sense
it is up to the user to decide what the function
returns, if anything. For instance, it may be useful for fn to compute the
objective function on either full training data, or on validation data,
and just ignore the index argument. The latter is useful if
using patience-based stopping.
However, one may also
choose to select the observations determined by the index to
compute the objective function on the current data batch.
object of class "maxim". Data can be extracted through the following methods:
maxValuefn value at maximum (the last calculated value
if not converged.)
coefestimated parameter value.
gradientvector, last calculated gradient value. Should be close to 0 in case of normal convergence.
matrix of gradients at parameter value estimate
evaluated at each observation (only if grad returns a matrix
or grad is not specified and fn returns a vector).
hessianHessian at the maximum (the last calculated value if not converged).
storedValuesreturn values stored at each epoch
storedParametersreturn parameters stored at each epoch
returnCodea numeric code that describes the convergence or error.
returnMessagea short message, describing the return code.
activeParlogical vector, which parameters are optimized over.
Contains only TRUE-s if no parameters are fixed.
nIternumber of iterations.
maximTypecharacter string, type of maximization.
maxControlthe optimization control parameters in the form of a
MaxControl object.
Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for Large-Scale Machine Learning SIAM Review, 2018, 60, 223–311.
Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press
Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458
A good starting point to learn about the usage of stochastic gradient ascent in maxLik package is the vignette “Stochastic Gradient Ascent in maxLik”.
The other related functions are
maxNR for Newton-Raphson, a popular Hessian-based maximization;
maxBFGS for maximization using the BFGS, Nelder-Mead (NM),
and Simulated Annealing (SANN) method (based on optim),
also supporting inequality constraints;
maxLik for a general framework for maximum likelihood
estimation (MLE);
optim for different gradient-based optimization
methods.
## estimate the exponential distribution parameter by ML
set.seed(1)
t <- rexp(100, 2)
loglik <- function(theta, index) sum(log(theta) - theta*t[index])
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta, index) sum(1/theta - t[index])
## Estimate with full-batch
a <- maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
SG_batchSize=10), nObs=100)
# note that loglik is not really needed, and is not used
# here, unless more print verbosity is asked
summary(a)
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 1000
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value:
#> Estimates:
#> estimate gradient
#> [1,] 2.099567 -0.1279617
#> --------------------------------------------
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik <- function(beta, index) {
e <- yValid - XValid %*% beta
-crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik <- function(beta, index) {
e <- yTrain[index] - XTrain[index,,drop=FALSE] %*% beta
g <- t(-2*t(XTrain[index,,drop=FALSE]) %*% e)
-g/length(index)
}
N <- 1000
## two random variables: one with scale 1, the other with 100
X <- cbind(rnorm(N), rnorm(N, sd=100))
beta <- c(1, 1) # true parameter values
y <- X %*% beta + rnorm(N, sd=0.2)
## training-validation split
iTrain <- sample(N, 0.8*N)
XTrain <- X[iTrain,,drop=FALSE]
XValid <- X[-iTrain,,drop=FALSE]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat(" No momentum:\n")
#> No momentum:
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
), nObs=length(yTrain))
#> Initial function value: -199008.4
#> --------------
#> Iteration limit exceeded (iterlim)
#> 50 iterations
#> estimate: 8.05622774286993, 1.56881067557959
#> Function value: -816.6898
print(summary(a)) # the first component is off, the second one is close to the true value
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 50
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value: -816.6898
#> Estimates:
#> estimate gradient
#> [1,] 8.056228 -22.18443
#> [2,] 1.568811 -10806.06327
#> --------------------------------------------
## do with momentum 0.99
cat(" Momentum 0.99:\n")
#> Momentum 0.99:
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
# no momentum
), nObs=length(yTrain))
#> Initial function value: -199008.4
#> --------------
#> Iteration limit exceeded (iterlim)
#> 50 iterations
#> estimate: 8.02519802176179e+23, 8.64902046641093e+25
#> Function value: -1.835145e+55
print(summary(a)) # close to true value
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 50
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value: -1.835145e+55
#> Estimates:
#> estimate gradient
#> [1,] 8.025198e+23 2.210386e+27
#> [2,] 8.649020e+25 -1.822211e+30
#> --------------------------------------------