parametrizations
#
Classes:
Name | Description |
---|---|
Parametrization |
Abstract base class for all neural network parametrizations. |
Standard |
Standard Parametrization (SP). |
NeuralTangent |
Neural Tangent Parametrization (NTP). |
MaximalUpdate |
Maximal update parametrization (\(\mu P\)). |
Parametrization
#
Bases: ABC
Abstract base class for all neural network parametrizations.
Methods:
Name | Description |
---|---|
bias_init_scale |
Compute the bias initialization scale for the given layer. |
bias_lr_scale |
Compute the learning rate scale for the bias parameters. |
weight_init_scale |
Compute the weight initialization scale for the given layer. |
weight_lr_scale |
Compute the learning rate scale for the weight parameters. |
bias_init_scale
#
Compute the bias initialization scale for the given layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fan_in
|
int
|
Number of inputs to the layer. |
required |
fan_out
|
int
|
Number of outputs from the layer. |
required |
layer_type
|
LayerType
|
Type of the layer. Can be one of "input", "hidden", or "output". |
'hidden'
|
bias_lr_scale
#
bias_lr_scale(
fan_in: int,
fan_out: int,
optimizer: str,
layer_type: LayerType = "hidden",
) -> float
Compute the learning rate scale for the bias parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fan_in
|
int
|
Number of inputs to the layer. |
required |
fan_out
|
int
|
Number of outputs from the layer. |
required |
optimizer
|
str
|
Optimizer being used. |
required |
layer_type
|
LayerType
|
Type of the layer. Can be one of "input", "hidden", or "output". |
'hidden'
|
weight_init_scale
#
Compute the weight initialization scale for the given layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fan_in
|
int
|
Number of inputs to the layer. |
required |
fan_out
|
int
|
Number of outputs from the layer. |
required |
layer_type
|
LayerType
|
Type of the layer. Can be one of "input", "hidden", or "output". |
'hidden'
|
weight_lr_scale
#
weight_lr_scale(
fan_in: int,
fan_out: int,
optimizer: str,
layer_type: LayerType = "hidden",
) -> float
Compute the learning rate scale for the weight parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fan_in
|
int
|
Number of inputs to the layer. |
required |
fan_out
|
int
|
Number of outputs from the layer. |
required |
optimizer
|
str
|
Optimizer being used. |
required |
layer_type
|
LayerType
|
Type of the layer. Can be one of "input", "hidden", or "output". |
'hidden'
|
Standard
#
Bases: Parametrization
Standard Parametrization (SP).
The default parametrization for neural networks in PyTorch. Also known as the 'fan_in' or 'LeCun' initialization. 'Kaiming' initialization is the same up to multiplicative constants.
Methods:
Name | Description |
---|---|
bias_init_scale |
|
bias_lr_scale |
|
weight_init_scale |
|
weight_lr_scale |
|
NeuralTangent
#
Bases: Parametrization
Neural Tangent Parametrization (NTP).
The neural tangent parametrization enables the theoretical analysis of the training dynamics of
infinite-width neural networks (Jacot et al., 2018
_) via the neural tangent kernel. However,
NTP does not admit feature learning, as features are effectively fixed at initialization.
.. _Jacot et al., 2018: http://arxiv.org/abs/1806.07572
Methods:
Name | Description |
---|---|
bias_init_scale |
|
bias_lr_scale |
|
weight_init_scale |
|
weight_lr_scale |
|
MaximalUpdate
#
Bases: Parametrization
Maximal update parametrization (\(\mu P\)).
The maximal update parametrization (Yang et al., 2021
, Yang et al., 2021b
) enforces:
- stable training dynamics, meaning (pre)activations and logits at activations are independent
of width at initialization and features and logits do not explode during training.
- feature learning, meaning the model can learn representations from the data at any width.
This parametrization is particularly useful when training large models, since it maintains stable training dynamics at any width and enables hyperparameter transfer. This means one can tune the learning rate on a smaller model and achieve good generalization with the tuned learning rate for a large model.
.. _Yang et al., 2021: http://arxiv.org/abs/2011.14522 .. _Yang et al., 2021b: http://arxiv.org/abs/2203.03466
Methods:
Name | Description |
---|---|
bias_init_scale |
|
bias_lr_scale |
|
weight_init_scale |
|
weight_lr_scale |
|