Skip to content

parametrizations #

Classes:

Name Description
Parametrization

Abstract base class for all neural network parametrizations.

Standard

Standard Parametrization (SP).

NeuralTangent

Neural Tangent Parametrization (NTP).

MaximalUpdate

Maximal update parametrization (\(\mu P\)).

Parametrization #

Bases: ABC

Abstract base class for all neural network parametrizations.

Methods:

Name Description
bias_init_scale

Compute the bias initialization scale for the given layer.

bias_lr_scale

Compute the learning rate scale for the bias parameters.

weight_init_scale

Compute the weight initialization scale for the given layer.

weight_lr_scale

Compute the learning rate scale for the weight parameters.

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

Compute the bias initialization scale for the given layer.

Parameters:

Name Type Description Default
fan_in int

Number of inputs to the layer.

required
fan_out int

Number of outputs from the layer.

required
layer_type LayerType

Type of the layer. Can be one of "input", "hidden", or "output".

'hidden'

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

Compute the learning rate scale for the bias parameters.

Parameters:

Name Type Description Default
fan_in int

Number of inputs to the layer.

required
fan_out int

Number of outputs from the layer.

required
optimizer str

Optimizer being used.

required
layer_type LayerType

Type of the layer. Can be one of "input", "hidden", or "output".

'hidden'

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

Compute the weight initialization scale for the given layer.

Parameters:

Name Type Description Default
fan_in int

Number of inputs to the layer.

required
fan_out int

Number of outputs from the layer.

required
layer_type LayerType

Type of the layer. Can be one of "input", "hidden", or "output".

'hidden'

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

Compute the learning rate scale for the weight parameters.

Parameters:

Name Type Description Default
fan_in int

Number of inputs to the layer.

required
fan_out int

Number of outputs from the layer.

required
optimizer str

Optimizer being used.

required
layer_type LayerType

Type of the layer. Can be one of "input", "hidden", or "output".

'hidden'

Standard #

Bases: Parametrization

Standard Parametrization (SP).

The default parametrization for neural networks in PyTorch. Also known as the 'fan_in' or 'LeCun' initialization. 'Kaiming' initialization is the same up to multiplicative constants.

Methods:

Name Description
bias_init_scale
bias_lr_scale
weight_init_scale
weight_lr_scale

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in,
    fan_out,
    optimizer: str,
    layer_type: LayerType = "hidden",
)

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

NeuralTangent #

Bases: Parametrization

Neural Tangent Parametrization (NTP).

The neural tangent parametrization enables the theoretical analysis of the training dynamics of infinite-width neural networks (Jacot et al., 2018_) via the neural tangent kernel. However, NTP does not admit feature learning, as features are effectively fixed at initialization.

.. _Jacot et al., 2018: http://arxiv.org/abs/1806.07572

Methods:

Name Description
bias_init_scale
bias_lr_scale
weight_init_scale
weight_lr_scale

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

MaximalUpdate #

Bases: Parametrization

Maximal update parametrization (\(\mu P\)).

The maximal update parametrization (Yang et al., 2021, Yang et al., 2021b) enforces: - stable training dynamics, meaning (pre)activations and logits at activations are independent of width at initialization and features and logits do not explode during training. - feature learning, meaning the model can learn representations from the data at any width.

This parametrization is particularly useful when training large models, since it maintains stable training dynamics at any width and enables hyperparameter transfer. This means one can tune the learning rate on a smaller model and achieve good generalization with the tuned learning rate for a large model.

.. _Yang et al., 2021: http://arxiv.org/abs/2011.14522 .. _Yang et al., 2021b: http://arxiv.org/abs/2203.03466

Methods:

Name Description
bias_init_scale
bias_lr_scale
weight_init_scale
weight_lr_scale

bias_init_scale #

bias_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

bias_lr_scale #

bias_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float

weight_init_scale #

weight_init_scale(
    fan_in: int,
    fan_out: int,
    layer_type: LayerType = "hidden",
) -> float

weight_lr_scale #

weight_lr_scale(
    fan_in: int,
    fan_out: int,
    optimizer: str,
    layer_type: LayerType = "hidden",
) -> float