Mean and Variance

In the context of math computation of activations (especially in neural networks), mean and variance describe the statistical properties of the activation values across some set of neurons or data samples.

1. Mean

Definition: The average value of the activations.
Formula (for $n$ values $x_1, x_2, ..., x_n$ ):
$\mu = \frac{1}{n} \sum_{i=1}^n x_i$
Interpretation in activations:
- Measures the central tendency of the activation values.
- A mean far from zero may cause bias shifts in the network.
- For example, if your ReLU outputs mostly large positive values, the mean might drift upward, which can impact learning stability.

2. Variance

Definition: A measure of how much the activation values spread out from the mean.
Formula:
$\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2$
Interpretation in activations:
- Measures the diversity or dispersion of activation values.
- High variance: activations are very spread out, which can cause exploding gradients.
- Low variance: activations are too similar, which can cause vanishing gradients.

Exploding Gradients and Vanishing Gradients

1. Exploding Gradients

What it is: Gradients grow excessively large during backpropagation.
Cause:
- Often happens in deep networks or RNNs when weights and activations multiply repeatedly, compounding values.
- High variance in activations can amplify this effect.
Effects:
- Weight updates become huge.
- Model parameters diverge instead of converging.
- Loss can oscillate wildly or become NaN.
Analogy: Like multiplying by a number > 1 many times — values blow up quickly.
Fixes:
- Gradient clipping (limit max gradient value).
- Proper weight initialization (e.g., Xavier/He init).
- Using normalization layers (BatchNorm, LayerNorm).

2. Vanishing Gradients

What it is: Gradients shrink toward zero during backpropagation.
Cause:
- Often happens in deep networks or RNNs when weights and activations multiply repeatedly by numbers between 0 and 1.
- Low variance in activations means most neurons output similar values, producing small derivatives.
Effects:
- Early layers stop learning because their weight updates become tiny.
- Training becomes extremely slow or stalls entirely.
Analogy: Like multiplying by a fraction repeatedly — values fade toward zero.
Fixes:
- Use activation functions that preserve gradients better (ReLU, GELU instead of sigmoid/tanh in deep layers).
- Proper initialization.
- Residual connections (ResNets).

Why They Matter in Neural Networks

Mean & variance control is essential for stable training.
Methods like Batch Normalization, Layer Normalization, and Weight Initialization (e.g., Xavier, He) explicitly compute and adjust these values to:
- Keep activations centered near zero.
- Maintain variance at a level that preserves gradient signal strength.

Example in Activations:

Suppose after a layer you have activations:

[1.2, 0.8, -0.5, 1.0]

Mean:
$\mu = \frac{1.2 + 0.8 - 0.5 + 1.0}{4} = 0.625$
Variance:
$\sigma^2 = \frac{(1.2-0.625)^2 + (0.8-0.625)^2 + (-0.5-0.625)^2 + (1.0-0.625)^2}{4} \approx 0.472$

These values tell you:

Activations are slightly biased positive ( $\mu > 0$ ).
There's moderate spread in the values.

1. Mean​

2. Variance​

1. Exploding Gradients​

2. Vanishing Gradients​

Why They Matter in Neural Networks​

1. Mean

2. Variance

1. Exploding Gradients

2. Vanishing Gradients

Why They Matter in Neural Networks