FISTA

“Fast iterative shrinkage-thresholding algorithm”(FISTA) is a proximal gradient method that aims to solve convex optimization problems of the form:

\[\min_x f(x) = g(x) + h(x)\]

where $g$ is a smooth convex function with a Lipschitz continuous gradient, and $h$ is a convex function that is possibly non-smooth but has a simple proximal operator.

Proximal Gradient & ISTA

Let us start with the classical proximal gradient method (also known as ISTA), based on which the FISTA algorithm is built.

For a given convex optimization problem \(\min_x f(x) = g(x) + h(x)\) where $g$ is differentiable, $\nabla g$ is L-Lipschitz and $h$ is not necessarily differentiable, proximal gradient method applies gradient descent on $g$. Its update goes in the following fashion:

\[\begin{aligned} x^{+} & =\underset{z}{\operatorname{argmin}}~\bar{g}_t(z)+h(z) \\ & =\underset{z}{\operatorname{argmin}}~g(x)+\nabla g(x)^T(z-x)+\frac{1}{2 t}\|z-x\|_2^2+h(z) \\ & =\underset{z}{\operatorname{argmin}}~\frac{1}{2 t}\|z-(x-t \nabla g(x))\|_2^2+h(z)\end{aligned}\]

where the step size $t\leq\frac{1}{L}$. This can be written as a proximal mapping:

\[\operatorname{prox}_{h, t}(x)=\underset{z}{\operatorname{argmin}}~ \frac{1}{2 t}\|x-z\|_2^2+h(z)\]

Then the proximal gradient update can be written as \(x^{(k)}=\operatorname{prox}_{h, t_k}\left(x^{(k-1)}-t_k \nabla g\left(x^{(k-1)}\right)\right)\)

or similar to gradient descent:

\(x^{(k)}=x^{(k-1)}-t_k \cdot G_{t_k}\left(x^{(k-1)}\right)\) where $G_t(x)=\frac{x-\operatorname{prox}_{h, t}(x-t \nabla g(x))}{t}$ is the generalized gradient of $f$.

For many important functions $h$, for example, $l_1$-norm for a vector, $l_{2,1}$ or nuclear norm for a matrix, there are closed-form proximal mapping $\operatorname{prox}_{h, t}$.

FISTA

For a convex optimization problem:

\[\min_x f(x) = g(x) + h(x)\]

where $\nabla g$ is L-Lipschitz, FISTA can be described as follows:

  • Step 0:
    • Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$.
  • Step $k(k \geq 1)$:
    • Compute
      • $x_k=p_L\left(y_k\right)$
      • $t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$
      • $y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

The step size can also be determined using a backtracking rule:

  • Step 0:
    • Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$, $\eta >1$, $L_0>0$.
  • Step $k(k \geq 1)$:
    • Find the smallest nonnegative integers $i_k$ such that with $\bar{L}=\eta^{i_k} L_{k-1}$,
      • $F\left(p_{\bar{L}}\left(y_k\right)\right) \leq Q_{\bar{L}}\left(p_{\bar{L}}\left(y_k\right), y_k\right) $
    • Set $L_k=\eta^{i_k} L_{k-1}$ and compute
      • $x_k =p_{L_k}\left(y_k\right)$
      • $t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$
      • $y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

FISTA with fixed step size $t \leq 1 / L$ satisfies \(f\left(x^{(k)}\right)-f^{\star} \leq \frac{2\left\|x^{(0)}-x^{\star}\right\|_2^2}{t(k+1)^2}\) and same result holds for backtracking, with $t$ replaced by $\beta / L$. This means that FISTA achieves an optimal rate of $O(\frac{1}{k^2})$ or $O(\frac{1}{\sqrt \epsilon})$.

The figures below show the comparison between ISTA and FISTA on lasso regression and lasso logistic regression. In both cases, $n=100, p=500$.


Figure 1: lasso regression

Figure 2: lasso logistic regression

References

  • Beck, Amir, and Marc Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences, vol. 2, no. 1, Mar. 2009, pp. 183–202, https://doi.org/10.1137/080716542.
  • Ryan Tibshirani, “Proximal gradient descent”, Convex Optimization: Fall 2018, https://www.stat.cmu.edu/~ryantibs/convexopt-F18/
Subscribe to our RSS feed.

Comments