FISTA

“Fast iterative shrinkage-thresholding algorithm”(FISTA) is a proximal gradient method that aims to solve convex optimization problems of the form:

\[\min_x f(x) = g(x) + h(x)\]

where $g$ is a smooth convex function with a Lipschitz continuous gradient, and $h$ is a convex function that is possibly non-smooth but has a simple proximal operator.

Proximal Gradient & ISTA

Let us start with the classical proximal gradient method (also known as ISTA), based on which the FISTA algorithm is built.

For a given convex optimization problem $\min_x f(x) = g(x) + h(x)$ where $g$ is differentiable, $\nabla g$ is L-Lipschitz and $h$ is not necessarily differentiable, proximal gradient method applies gradient descent on $g$. Its update goes in the following fashion:

\[\begin{aligned} x^{+} & =\underset{z}{\operatorname{argmin}}~\bar{g}_t(z)+h(z) \\ & =\underset{z}{\operatorname{argmin}}~g(x)+\nabla g(x)^T(z-x)+\frac{1}{2 t}\|z-x\|_2^2+h(z) \\ & =\underset{z}{\operatorname{argmin}}~\frac{1}{2 t}\|z-(x-t \nabla g(x))\|_2^2+h(z)\end{aligned}\]

where the step size $t\leq\frac{1}{L}$. This can be written as a proximal mapping:

\[\operatorname{prox}_{h, t}(x)=\underset{z}{\operatorname{argmin}}~ \frac{1}{2 t}\|x-z\|_2^2+h(z)\]

Then the proximal gradient update can be written as $x^{(k)}=\operatorname{prox}_{h, t_k}\left(x^{(k-1)}-t_k \nabla g\left(x^{(k-1)}\right)\right)$

or similar to gradient descent:

$x^{(k)}=x^{(k-1)}-t_k \cdot G_{t_k}\left(x^{(k-1)}\right)$ where $G_t(x)=\frac{x-\operatorname{prox}_{h, t}(x-t \nabla g(x))}{t}$ is the generalized gradient of $f$.

For many important functions $h$, for example, $l_1$-norm for a vector, $l_{2,1}$ or nuclear norm for a matrix, there are closed-form proximal mapping $\operatorname{prox}_{h, t}$.

FISTA

For a convex optimization problem:

\[\min_x f(x) = g(x) + h(x)\]

where $\nabla g$ is L-Lipschitz, FISTA can be described as follows:

Step 0:

Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$.

Step $k(k \geq 1)$:

Compute

$x_k=p_L\left(y_k\right)$

$t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$

$y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

The step size can also be determined using a backtracking rule:

Step 0:

Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$, $\eta >1$, $L_0>0$.

Step $k(k \geq 1)$:

Find the smallest nonnegative integers $i_k$ such that with $\bar{L}=\eta^{i_k} L_{k-1}$,

$F\left(p_{\bar{L}}\left(y_k\right)\right) \leq Q_{\bar{L}}\left(p_{\bar{L}}\left(y_k\right), y_k\right) $

Set $L_k=\eta^{i_k} L_{k-1}$ and compute

$x_k =p_{L_k}\left(y_k\right)$

$t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$

$y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

FISTA with fixed step size $t \leq 1 / L$ satisfies $f\left(x^{(k)}\right)-f^{\star} \leq \frac{2\left\|x^{(0)}-x^{\star}\right\|_2^2}{t(k+1)^2}$ and same result holds for backtracking, with $t$ replaced by $\beta / L$. This means that FISTA achieves an optimal rate of $O(\frac{1}{k^2})$ or $O(\frac{1}{\sqrt \epsilon})$.

The figures below show the comparison between ISTA and FISTA on lasso regression and lasso logistic regression. In both cases, $n=100, p=500$.

Figure 1: lasso regression

Figure 2: lasso logistic regression

References

Beck, Amir, and Marc Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences, vol. 2, no. 1, Mar. 2009, pp. 183–202, https://doi.org/10.1137/080716542.
Ryan Tibshirani, “Proximal gradient descent”, Convex Optimization: Fall 2018, https://www.stat.cmu.edu/~ryantibs/convexopt-F18/

FISTA

Proximal Gradient & ISTA

FISTA

References

Comments