parametrizations #

Classes:

Name	Description
`Parametrization`	Abstract base class for all neural network parametrizations.
`Standard`	Standard Parametrization (SP).
`NeuralTangent`	Neural Tangent Parametrization (NTP).
`MaximalUpdate`	Maximal update parametrization (\(\mu P\)).

Parametrization #

Bases: ABC

Abstract base class for all neural network parametrizations.

Methods:

Name	Description
`bias_init_scale`	Compute the bias initialization scale for the given layer.
`bias_lr_scale`	Compute the learning rate scale for the bias parameters.
`weight_init_scale`	Compute the weight initialization scale for the given layer.
`weight_lr_scale`	Compute the learning rate scale for the weight parameters.

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

Compute the bias initialization scale for the given layer.

Parameters:

Name	Type	Description	Default
`fan_in`	`int`	Number of inputs to the layer.	required
`fan_out`	`int`	Number of outputs from the layer.	required
`layer_type`	`LayerType`	Type of the layer. Can be one of "input", "hidden", or "output".	`'hidden'`

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

Compute the learning rate scale for the bias parameters.

Parameters:

Name	Type	Description	Default
`fan_in`	`int`	Number of inputs to the layer.	required
`fan_out`	`int`	Number of outputs from the layer.	required
`optimizer`	`str`	Optimizer being used.	required
`layer_type`	`LayerType`	Type of the layer. Can be one of "input", "hidden", or "output".	`'hidden'`

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

Compute the weight initialization scale for the given layer.

Parameters:

Name	Type	Description	Default
`fan_in`	`int`	Number of inputs to the layer.	required
`fan_out`	`int`	Number of outputs from the layer.	required
`layer_type`	`LayerType`	Type of the layer. Can be one of "input", "hidden", or "output".	`'hidden'`

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

Compute the learning rate scale for the weight parameters.

Parameters:

Name	Type	Description	Default
`fan_in`	`int`	Number of inputs to the layer.	required
`fan_out`	`int`	Number of outputs from the layer.	required
`optimizer`	`str`	Optimizer being used.	required
`layer_type`	`LayerType`	Type of the layer. Can be one of "input", "hidden", or "output".	`'hidden'`

Standard #

Bases: Parametrization

Standard Parametrization (SP).

The default parametrization for neural networks in PyTorch. Also known as the 'fan_in' or 'LeCun' initialization. 'Kaiming' initialization is the same up to multiplicative constants.

Methods:

Name	Description
`bias_init_scale`
`bias_lr_scale`
`weight_init_scale`
`weight_lr_scale`

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in,
    fan_out,
    optimizer: str,
    layer_type: LayerType = "hidden",
)

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

NeuralTangent #

Bases: Parametrization

Neural Tangent Parametrization (NTP).

The neural tangent parametrization enables the theoretical analysis of the training dynamics of infinite-width neural networks (Jacot et al., 2018_) via the neural tangent kernel. However, NTP does not admit feature learning, as features are effectively fixed at initialization.

.. _Jacot et al., 2018: http://arxiv.org/abs/1806.07572

Methods:

Name	Description
`bias_init_scale`
`bias_lr_scale`
`weight_init_scale`
`weight_lr_scale`

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

MaximalUpdate #

Bases: Parametrization

Maximal update parametrization (\(\mu P\)).

The maximal update parametrization (Yang et al., 2021, Yang et al., 2021b) enforces: - stable training dynamics, meaning (pre)activations and logits at activations are independent of width at initialization and features and logits do not explode during training. - feature learning, meaning the model can learn representations from the data at any width.

This parametrization is particularly useful when training large models, since it maintains stable training dynamics at any width and enables hyperparameter transfer. This means one can tune the learning rate on a smaller model and achieve good generalization with the tuned learning rate for a large model.

.. _Yang et al., 2021: http://arxiv.org/abs/2011.14522 .. _Yang et al., 2021b: http://arxiv.org/abs/2203.03466

Methods:

Name	Description
`bias_init_scale`
`bias_lr_scale`
`weight_init_scale`
`weight_lr_scale`

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float