Gradient descent

General gradient descent algorithm

Iteratively update the parameters 𝐱\mathbf{x} by making small adjustment that decreases f(𝐱)f(\mathbf{x}).

In particular, update 𝐱𝐱+η𝐯\mathbf{x} \leftarrow \mathbf{x} + \eta\mathbf{v} where η>0\eta > 0 is step size.

Gradient descent method

Iteratively update the current estimate in the direction opposite the gradient direction 𝐱(l+1)=𝐱(l)αJ𝐱|𝐱(l)\mathbf{x}^{(l+1)}=\mathbf{x}^{(l)}-\alpha \frac{\partial J}{\partial \mathbf{x}}\Bigr|_{\substack{\mathbf{x}^{(l)}}} The solution depends on the initial condition. Reaches the local minimum closest to the initial condition if the stepsize is chosen properly.

Yield global optimal if J is convex, regardless initial solution

Gradient descent analysis

Assume:

Gradient descent:

Gradient descent convergence bound


Gradient descent for constrained optimization

DD is the diameter for 𝒦\mathcal{K}, or if unconstrained, then simply an upper bound on ||x0x*||||x_0 - x^*||. GG is an upper bund on the size of ff‘s gradient, i.e. ||f(x)||2G,x||\nabla f(x)||_2 \leq G, \forall x.


See: Stochastic gradient descent, Gradient


References:

  1. https://www.chrismusco.com/amlds2023/notes/lecture06.html#Projected_Gradient_Descent
  2. https://www.cs.princeton.edu/courses/archive/fall18/cos521/Lectures/lec16.pdf
  3. https://en.wikipedia.org/wiki/Gradient_descent
  4. https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf