Gradient descent

General gradient descent algorithm

Iteratively update the parameters $\mathbf{x}$ by making small adjustment that decreases $f(\mathbf{x})$ .

In particular, update $\mathbf{x} \leftarrow \mathbf{x} + \eta\mathbf{v}$ where $\eta > 0$ is step size.

Gradient descent method

Iteratively update the current estimate in the direction opposite the gradient direction $\mathbf{x}^{(l+1)}=\mathbf{x}^{(l)}-\alpha \frac{\partial J}{\partial \mathbf{x}}\Bigr|_{\substack{\mathbf{x}^{(l)}}}$ The solution depends on the initial condition. Reaches the local minimum closest to the initial condition if the stepsize is chosen properly.

Yield global optimal if J is convex, regardless initial solution

Gradient descent analysis

Assume:

$f$ is convex
Lipschitz function: for all $\mathbf{x}$ , $||\nabla f(\mathbf{x})||_2 \leq G$
starting radius: $||\mathbf{x}^{*}-\mathbf{x}^{(0)}||_2 \leq R$

Gradient descent:

choose number of steps $T$
starting point $\mathbf{x}^{(0)}$ , e.g. $\mathbf{x}^{(0)} = \vec{0}$
$\eta = \frac{R}{G \sqrt{T}}$
for i=0,…,Ti=0,…,T:
- $\mathbf{x}^{(i+1)}=\mathbf{x}^{(i)}-\eta \nabla f(\mathbf{x}^{(i)})$
return $\hat{\mathbf{x}} = \arg \min_{\mathbf{x}^{(i)}} f(\mathbf{x}^{(i)})$

Gradient descent convergence bound

Gradient descent for constrained optimization

Let $\eta==\frac{D}{G\sqrt{T}}$
Let $x_0$ be any point in $\mathcal{K}$ .
Repeat for $i=0$ to $T$
$y_{i+1} \leftarrow x_i -\eta\nabla f(x_i)$
$x_{i+1} \leftarrow$ Projection of $y_{i+1}$ on $\mathcal{K}$
At the end output $\bar{x}=\frac{1}{T} \sum_{i=0}^T x_i$ .

$D$ is the diameter for $\mathcal{K}$ , or if unconstrained, then simply an upper bound on $||x_0 - x^*||$ . $G$ is an upper bund on the size of $f$ ‘s gradient, i.e. $||\nabla f(x)||_2 \leq G, \forall x$ .

See: Stochastic gradient descent, Gradient

References: