A Coefficient Makes SVRG Effective

Abstract

Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlight, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning.

Background

SVRG

Proposed in 2013, SVRG is a simple approach for reducing gradient variance in SGD. It works very well in simple machine learning models, such as Logistic Regression. $$g_i^t = \nabla f_i(\theta^t)-(\nabla f_i(\theta^{\text{snapshot}})-\nabla f(\theta^{\text{snapshot}}))$$

Gradient Variance

We use the following metrics to measure the gradient variance during training:

name	formula	description
metric 1	$\frac{2}{N(N-1)}\sum_{i\neq j}\frac{1}{2}(1-\frac{\langle g_i^t,g_j^t\rangle}{\\|g_i^t\\|_2\\|g_j^t\\|_2})$	the directional variance of the gradients
metric 2	$\sum_{k=1}^d\text{Var}(g_{i, k}^t)$	the variance of gradients across each component
metric 3	$\lambda_{max}(\frac{1}{N}\sum_{i=1}^N(g_i^t-g^t)(g_i^t-g^t)^T)$	the magnitude of the most significant variation

SVRG on MLP-4

We observe that SVRG might even increase gradient variance on MLP-4 and leads to slower convergence.

Why does SVRG increase gradient variance on deeper models?

Analysis

Control Variates

The Control Variates method reduces the variance of the estimate $\textnormal{X}$ using another correlated random variable $\textnormal{Y}$. We can derive the optimal coefficient $\alpha$ that minimizes the variance of the estimate:

$$\textnormal{X}^* = \textnormal{X} - \alpha (\textnormal{Y} - \mathbb{E}[\textnormal{Y}])$$

$$\implies \alpha^* = \frac{\text{Cov}(\textnormal{X}, \textnormal{Y})}{\text{Var}(\textnormal{Y})}=\rho(\textnormal{X}, \textnormal{Y})\frac{\sigma(\textnormal{X})}{\sigma(\textnormal{Y})}$$

Optimal Coefficient in SVRG

We introduce a coefficient vector to SVRG and apply control variates to each component:

$$g_i^t = \nabla f_i(\theta^t)-\alpha^t\odot(\nabla f_i(\theta^{\text{snapshot}})-\nabla f(\theta^{\text{snapshot}}))$$

$$\implies \alpha^{t*}_k = \frac{\text{Cov}(\nabla f_{\cdot,k}(\theta^{\text{snapshot}}), \nabla f_{\cdot,k}(\theta^t))}{\text{Var}(\nabla f_{\cdot,k}(\theta^{\text{snapshot}}))} = \rho(\nabla f_{\cdot, k}(\theta^{\text{snapshot}}), \nabla f_{\cdot, k}(\theta^t))\frac{\sigma(\nabla f_{\cdot, k}(\theta^t))}{\sigma(\nabla f_{\cdot, k}(\theta^{\text{snapshot}}))}$$

Observations on Optimal Coefficient

A deeper model has a smaller optimal coefficient.
The optimal coefficient decreases as training progresses.

Optimal coefficient with AdamW — (b) AdamW

How do we approximate the optimal coefficient?

Method

$\alpha$-SVRG

We propose to apply a linearly decreasing coefficient $\alpha$ to control the variance reduction strength. It achieves a similar gradient variance reduction effect compared with SVRG using optimal coefficient.

Experiments

Results on ImageNet-1K

Results on Small Image Classification Datasets

BibTeX

@inproceedings{yin2023coefficient,
      title={A Coefficient Makes SVRG Effective}, 
      author={Yida Yin and Zhiqiu Xu and Zhiyuan Li and Trevor Darrell and Zhuang Liu},
      year={2025},
      booktitle={ICLR},
    }

name	formula	description
metric 1	\(\frac{2}{N(N-1)}\sum_{i\neq j}\frac{1}{2}(1-\frac{\langle g_i^t,g_j^t\rangle}{\\|g_i^t\\|_2\\|g_j^t\\|_2})\)	the directional variance of the gradients
metric 2	\(\sum_{k=1}^d\text{Var}(g_{i, k}^t)\)	the variance of gradients across each component
metric 3	\(\lambda_{max}(\frac{1}{N}\sum_{i=1}^N(g_i^t-g^t)(g_i^t-g^t)^T)\)	the magnitude of the most significant variation