AdaBelief

class lib.model.optimizers.keras_legacy.AdaBelief(*args, **kwargs)

Bases: Optimizer

Implementation of the AdaBelief Optimizer

Inherits from: keras.optimizers.Optimizer.

AdaBelief Optimizer is not a placement of the heuristic warmup, the settings should be kept if warmup has already been employed and tuned in the baseline method. You can enable warmup by setting total_steps and warmup_proportion (see examples)

Lookahead (see references) can be integrated with AdaBelief Optimizer, which is announced by Less Wright and the new combined optimizer can also be called “Ranger”. The mechanism can be enabled by using the lookahead wrapper. (See examples)

Parameters:

learning_rate (float) – The learning rate.
beta_1 (float) – The exponential decay rate for the 1st moment estimates.
beta_2 (float) – The exponential decay rate for the 2nd moment estimates.
epsilon (float) – A small constant for numerical stability.
amsgrad (bool) – Whether to apply AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”.
rectify (bool) – Whether to enable rectification as in RectifiedAdam
sma_threshold (float) – The threshold for simple mean average.
total_steps (int) – Total number of training steps. Enable warmup by setting a positive value.
warmup_proportion (float) – The proportion of increasing steps.
min_lr – Minimum learning rate after warmup.
name – Name for the operations created when applying gradients. Default: "AdaBeliefOptimizer".
**kwargs – Standard Keras Optimizer keyword arguments. Allowed to be (weight_decay, clipnorm, clipvalue, global_clipnorm, use_ema, ema_momentum, ema_overwrite_frequency, loss_scale_factor, gradient_accumulation_steps)
min_learning_rate (float)

Examples

>>> from optimizers import AdaBelief
>>> opt = AdaBelief(lr=1e-3)

Example of serialization:

>>> optimizer = AdaBelief(learning_rate=lr_scheduler, weight_decay=wd_scheduler)
>>> config = keras.optimizers.serialize(optimizer)
>>> new_optimizer = keras.optimizers.deserialize(config,
...                                                 custom_objects=dict(AdaBelief=AdaBelief))

Example of warm up:

>>> opt = AdaBelief(lr=1e-3, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5)

In the above example, the learning rate will increase linearly from 0 to lr in 1000 steps, then decrease linearly from lr to min_lr in 9000 steps.

Example of enabling Lookahead:

>>> adabelief = AdaBelief()
>>> ranger = tfa.optimizers.Lookahead(adabelief, sync_period=6, slow_step_size=0.5)

Notes

amsgrad is not described in the original paper. Use it with caution.

References

Juntang Zhuang et al. - AdaBelief Optimizer: Adapting step sizes by the belief in observed gradients - https://arxiv.org/abs/2010.07468.

Original implementation - https://github.com/juntang-zhuang/Adabelief-Optimizer

Michael R. Zhang et.al - Lookahead Optimizer: k steps forward, 1 step back - https://arxiv.org/abs/1907.08610v1

Adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer

BSD 2-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Attributes Summary

`iterations`
`learning_rate`
`variables`

Methods Summary

`add_optimizer_variables`(trainable_variables, ...)	Add optimizer variables from the list of trainable model variables.
`add_variable`(shape[, initializer, dtype, ...])	Add a variable to the optimizer.
`add_variable_from_reference`(reference_variable)	Add an optimizer variable from the model variable.
`apply`(grads[, trainable_variables])	Update traininable variables according to provided gradient values.
`apply_gradients`(grads_and_vars)
`assign`(variable, value)	Assign a value to a variable.
`assign_add`(variable, value)	Add a value to a variable.
`assign_sub`(variable, value)	Subtract a value from a variable.
`build`(variables)	Initialize optimizer variables.
`exclude_from_weight_decay`([var_list, var_names])	Exclude variables from weight decay.
`finalize_variable_values`(var_list)	Set the final value of model's trainable variables.
`from_config`(config[, custom_objects])	Creates an optimizer from its config.
`get_config`()	Returns the config of the optimizer.
`load_own_variables`(store)	Set the state of this optimizer object.
`save_own_variables`(store)	Get the state of this optimizer object.
`scale_loss`(loss)	Scale the loss before computing gradients.
`set_weights`(weights)	Set the weights of the optimizer.
`stateless_apply`(optimizer_variables, grads, ...)	Stateless version of apply that returns modified variables.
`update_step`(gradient, variable, learning_rate)	Update step given gradient and the associated model variable for AdaBelief.

Attributes Documentation

iterations

learning_rate

variables

Methods Documentation

add_optimizer_variables(trainable_variables, name, initializer='zeros')

Add optimizer variables from the list of trainable model variables.

Create an optimizer variable based on the information of the supplied model variables. For example, in SGD optimizer momemtum, for each model variable, a corresponding momemtum variable is created of the same shape and dtype.

Note that trainable variables with v.overwrite_with_gradient == True will insert None, into the output list, since the optimizer variable will not be used anyways, and could be wasteful.

Parameters:

trainable_variables – keras.Variable, the corresponding model variable to the optimizer variable to be created.
name – The name prefix(es) of the optimizer variable(s) to be created. Can be a single string or list of strings. If a list of strings, will create an optimizer variable for each prefix. The variable name will follow the pattern {variable_name}_{trainable_variable.name}, e.g., momemtum/dense_1.
initializer – Initializer object(s) to use to populate the initial variable value(s), or string name of a built-in initializer (e.g. “random_normal”). If unspecified, defaults to “zeros”.

Returns:

A list of optimizer variables, in the format of `keras.Variable`s. If multiple names are provide, returns a tuple of lists.

add_variable(shape, initializer='zeros', dtype=None, aggregation='none', layout=None, name=None)

Add a variable to the optimizer.

Parameters:

shape – Shape tuple for the variable. Must be fully-defined (no None entries).
initializer – Initializer object to use to populate the initial variable value, or string name of a built-in initializer (e.g. “random_normal”). Defaults to “zeros”.
dtype – Dtype of the variable to create, e.g. “float32”. If unspecified, defaults to the keras.backend.floatx().
aggregation – Optional string, one of None, “none”, “mean”, “sum” or “only_first_replica”. Annotates the variable with the type of multi-replica aggregation to be used for this variable when writing custom data parallel training loops. Defaults to “none”.
layout – Optional tensor layout. Defaults to None.
name – String name of the variable. Useful for debugging purposes.

Returns:

An optimizer variable, in the format of keras.Variable.

add_variable_from_reference(reference_variable, name=None, initializer='zeros')

Add an optimizer variable from the model variable.

Create an optimizer variable based on the information of model variable. For example, in SGD optimizer momemtum, for each model variable, a corresponding momemtum variable is created of the same shape and dtype.

Parameters:

reference_variable – keras.Variable. The corresponding model variable to the optimizer variable to be created.
name – Optional string. The name prefix of the optimizer variable to be created. If not provided, it will be set to “var”. The variable name will follow the pattern {variable_name}_{reference_variable.name}, e.g., momemtum/dense_1. Defaults to None.
initializer – Initializer object to use to populate the initial variable value, or string name of a built-in initializer (e.g. “random_normal”). If unspecified, defaults to “zeros”.

Returns:

An optimizer variable, in the format of keras.Variable.

apply(grads, trainable_variables=None)

Update traininable variables according to provided gradient values.

grads should be a list of gradient tensors with 1:1 mapping to the list of variables the optimizer was built with.

trainable_variables can be provided on the first call to build the optimizer.

apply_gradients(grads_and_vars)

assign(variable, value)

Assign a value to a variable.

This should be used in optimizers instead of variable.assign(value) to support backend specific optimizations. Note that the variable can be a model variable or an optimizer variable; it can be a backend native variable or a Keras variable.

Parameters:

variable – The variable to update.
value – The value to add to the variable.

assign_add(variable, value)

Add a value to a variable.

This should be used in optimizers instead of variable.assign_add(value) to support backend specific optimizations. Note that the variable can be a model variable or an optimizer variable; it can be a backend native variable or a Keras variable.

Parameters:

variable – The variable to update.
value – The value to add to the variable.

assign_sub(variable, value)

Subtract a value from a variable.

This should be used in optimizers instead of variable.assign_sub(value) to support backend specific optimizations. Note that the variable can be a model variable or an optimizer variable; it can be a backend native variable or a Keras variable.

Parameters:

variable – The variable to update.
value – The value to add to the variable.

build(variables: list[Variable]) → None

Initialize optimizer variables.

AdaBelief optimizer has 3 types of variables: momentums, velocities and velocity_hat (only set when amsgrad is applied),

Parameters:: variables (list[Variable]) – list of model variables to build AdaBelief variables on.
Return type:: None

exclude_from_weight_decay(var_list=None, var_names=None)

Exclude variables from weight decay.

This method must be called before the optimizer’s build method is called. You can set specific variables to exclude out, or set a list of strings as the anchor words, if any of which appear in a variable’s name, then the variable is excluded.

Parameters:

var_list – A list of `Variable`s to exclude from weight decay.
var_names – A list of strings. If any string in var_names appear in the model variable’s name, then this model variable is excluded from weight decay. For example, var_names=[‘bias’] excludes all bias variables from weight decay.

finalize_variable_values(var_list)

Set the final value of model’s trainable variables.

Sometimes there are some extra steps before ending the variable updates, such as overriding the model variables with its average value.

Parameters:: var_list – list of model variables.

classmethod from_config(config, custom_objects=None)

Creates an optimizer from its config.

This method is the reverse of get_config, capable of instantiating the same optimizer from the config dictionary.

Parameters:

config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional user-defined Python objects needed to recreate this optimizer.

Returns:

An optimizer instance.

get_config() → dict[str, Any]

Returns the config of the optimizer.

Optimizer configuration for AdaBelief.

Returns:: The optimizer configuration.
Return type:: dict[str, Any]

load_own_variables(store): Set the state of this optimizer object.

save_own_variables(store): Get the state of this optimizer object.

scale_loss(loss)

Scale the loss before computing gradients.

Scales the loss before gradients are computed in a train_step. This is primarily useful during mixed precision training to prevent numeric underflow.

set_weights(weights): Set the weights of the optimizer.

stateless_apply(optimizer_variables, grads, trainable_variables)

Stateless version of apply that returns modified variables.

Parameters:

optimizer_variables – list of tensors containing the current values for the optimizer variables. These are native tensors and not `keras.Variable`s.
grads – list of gradients to apply.
trainable_variables – list of tensors containing the current values for the model variables. These are native tensors and not `keras.Variable`s.

Returns: A tuple containing two list of tensors, the updated: trainable_variables and the updated optimizer_variables.

update_step(gradient: Tensor, variable: Variable, learning_rate: Tensor) → None

Update step given gradient and the associated model variable for AdaBelief.

Parameters:

gradient (Tensor) – The gradient to update
variable (Variable) – The variable to update
learning_rate (Tensor) – The learning rate

Return type:

None

build(variables: list[Variable]) → None

Initialize optimizer variables.

AdaBelief optimizer has 3 types of variables: momentums, velocities and velocity_hat (only set when amsgrad is applied),

Parameters:: variables (list[Variable]) – list of model variables to build AdaBelief variables on.
Return type:: None

get_config() → dict[str, Any]

Returns the config of the optimizer.

Optimizer configuration for AdaBelief.

Returns:: The optimizer configuration.
Return type:: dict[str, Any]

update_step(gradient: Tensor, variable: Variable, learning_rate: Tensor) → None

Update step given gradient and the associated model variable for AdaBelief.

Parameters:

gradient (Tensor) – The gradient to update
variable (Variable) – The variable to update
learning_rate (Tensor) – The learning rate

Return type:

None