Dandelion's update
module is mostly inherited from Lasagne, you're recommended to refer to Lasagne.updates
document for the following optimizers & helper functions:
- apply_momentum
- momentum
- apply_nesterov_momentum
- nesterov_momentum
- adagrad
- rmsprop
- adamax
- norm_constraint
- total_norm_constrain
sgd
Stochastic gradient descent optimizer.
sgd(loss_or_grads, params, learning_rate=1e-4, clear_nan=False)
- loss_or_grads: a scalar loss expression, or a list of gradient expressions
- params: list of shared variables to generate update expressions for
- learning_rate: float or symbolic scalar, learning rate controlling the size of update steps
- clear_nan: boolean flag, if
True
,nan
in gradients will be replaced with 0
adam
Adam optimizer implemented as described in "Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980."
adam(loss_or_grads, params, learning_rate=0.001, beta1=0.9,
beta2=0.999, epsilon=1e-8, clear_nan=False)
- loss_or_grads: a scalar loss expression, or a list of gradient expressions
- params: list of shared variables to generate update expressions for
- learning_rate: float or symbolic scalar, learning rate controlling the size of update steps
- clear_nan: boolean flag, if
True
,nan
in gradients will be replaced with 0 - beta1: float or symbolic scalar, exponential decay rate for the first moment estimates
- beta2: float or symbolic scalar, exponential decay rate for the second moment estimates
- epsilon: float or symbolic scalar, constant for numerical stability
adadelta
Adadelta optimizer implemented as described in "Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701."
adadelta(loss_or_grads, params, learning_rate=1.0,
rho=0.95, epsilon=1e-6, clear_nan=False)
- loss_or_grads: a scalar loss expression, or a list of gradient expressions
- params: list of shared variables to generate update expressions for
- learning_rate: float or symbolic scalar, learning rate controlling the size of update steps
- clear_nan: boolean flag, if
True
,nan
in gradients will be replaced with 0 - rho: float or symbolic scalar, squared gradient moving average decay factor
- epsilon: float or symbolic scalar, constant for numerical stability
rho
should be between 0 and 1. A value of rho
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
rho
= 0.95 and epsilon
=1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech).
In the paper, no learning rate is considered (so learning_rate
=1.0). Probably best to keep it at this value. epsilon
is important for the very first update (so the numerator does not become 0).