<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>no title</title>
    <description>-</description>
    <link>https://clair-clemson.github.io/blog/</link>
    <atom:link href="https://clair-clemson.github.io/blog/feed.xml" rel="self" type="application/rss+xml" />
    
     
      <item>
        <title>Sparse Orthogonality-Constrained Optimization via ADPMM</title>
        <description>&lt;p&gt;Optimization problems with orthogonality constraints arise throughout machine learning, including dimensionality reduction, clustering, and representation learning. These problems require the variable to lie on the &lt;em&gt;Stiefel manifold&lt;/em&gt;, the set of matrices with orthonormal columns. While orthogonality improves interpretability and numerical stability, it also makes the feasible set nonconvex, and things get harder still when the objective contains a non-smooth sparsity-inducing term.&lt;/p&gt;

\[\newcommand{\xm}{\mathbf{X}}
\newcommand{\am}{\mathbf{A}}
\newcommand{\bm}{\mathbf{B}}
\newcommand{\lm}{\mathbf{L}}
\newcommand{\hm}{\mathbf{H}}
\newcommand{\mm}{\mathbf{M}}
\newcommand{\gm}{\mathbf{G}}
\newcommand{\qm}{\mathbf{Q}}
\newcommand{\pm}{\mathbf{P}}
\newcommand{\um}{\mathbf{U}}
\newcommand{\vm}{\mathbf{V}}
\newcommand{\ym}{\mathbf{Y}}
\newcommand{\zm}{\mathbf{Z}}
\newcommand{\im}{\mathbf{I}}
\newcommand{\fm}{\mathbf{F}}
\newcommand{\vx}{\mathbf{x}}
\newcommand{\vy}{\mathbf{y}}
\newcommand{\vz}{\mathbf{z}}
\newcommand{\vc}{\mathbf{c}}
\newcommand{\vb}{\mathbf{b}}
\newcommand{\vu}{\mathbf{u}}
\newcommand{\vt}{\mathbf{t}}
\newcommand{\vg}{\mathbf{g}}
\newcommand{\vw}{\mathbf{w}}
\newcommand{\trace}{\operatorname{tr}}
\newcommand{\diag}{\operatorname{diag}}
\newcommand{\sign}{\operatorname{sign}}
\newcommand{\norm}[1]{\left\|#1\right\|}
\newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle}
\newcommand{\pos}[1]{\left[#1\right]_+}\]

&lt;p&gt;A prominent example is &lt;em&gt;sparse principal component analysis&lt;/em&gt; (sparse PCA), which enhances interpretability by enforcing sparsity on the principal components while keeping them orthogonal:&lt;/p&gt;

\[\min_{\xm}\; -\frac{1}{2}\trace(\xm^\top \am^\top\am \xm)+\lambda\norm{\xm}_1
\quad\text{s.t.}\quad \xm^\top\xm=\im,\]

&lt;p&gt;where $\am$ is the data matrix (each row a sample), $\xm$ collects the principal components, and $\lambda$ controls the sparsity level. Similarly, &lt;em&gt;Sparse Spectral Clustering&lt;/em&gt; (SSC) injects sparsity into spectral embeddings to improve robustness and interpretability in graph-based learning:&lt;/p&gt;

\[\min_{\xm}\; \frac{1}{2}\trace(\xm^\top \lm \xm)+\lambda\norm{\xm}_1
\quad\text{s.t.}\quad \xm^\top\xm=\im,\]

&lt;p&gt;where $\lm$ is a graph Laplacian. The same template covers many more models: &lt;strong&gt;unsupervised feature selection&lt;/strong&gt; replaces the $\ell_1$ norm with the row-sparsity-promoting $\ell_{2,1}$ norm, and &lt;strong&gt;compressed modes in physics&lt;/strong&gt; seeks spatially localized solutions of the independent-particle Schrödinger equation by taking the quadratic term to be a discretized Schrödinger operator $\hm$. In every case the difficulty is the same: a &lt;strong&gt;non-smooth sparsity term sitting on top of a nonconvex orthogonality constraint&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Existing solvers largely fall into two camps. &lt;em&gt;Riemannian&lt;/em&gt; methods (ManPG, ManPG-Ada, RADMM, ARADMM, OADMM) operate directly on the Stiefel manifold and preserve feasibility via retractions or projections, but they typically need repeated manifold operations or per-iteration line searches, which become expensive at scale. &lt;em&gt;Relaxation or splitting&lt;/em&gt; methods decouple sparsity from orthogonality, but may sacrifice feasibility or introduce approximation error (e.g., Moreau-envelope smoothing of the regularizer). In this blog we discuss an algorithm which employs the former: an &lt;strong&gt;Alternating Direction Proximal Method of Multipliers (ADPMM)&lt;/strong&gt; that handles the orthogonality constraint &lt;em&gt;without relaxation&lt;/em&gt;, and whose per-iteration work reduces to one Stiefel projection plus one element-wise soft-thresholding.&lt;/p&gt;

&lt;h2 id=&quot;from-admm-to-adpmm&quot;&gt;From ADMM to ADPMM&lt;/h2&gt;

&lt;p&gt;Recall the classical Alternating Direction Method of Multipliers (ADMM). For a separable problem&lt;/p&gt;

\[\min_{\vx,\vz}\; H(\vx,\vz)=h_1(\vx)+h_2(\vz)
\quad\text{s.t.}\quad \am\vx+\bm\vz=\vc,\]

&lt;p&gt;ADMM works on the augmented Lagrangian with dual variable $\vy$ and penalty $\rho&amp;gt;0$:&lt;/p&gt;

\[L_\rho(\vx,\vz,\vy)=h_1(\vx)+h_2(\vz)+\vy^\top(\am\vx+\bm\vz-\vc)+\frac{\rho}{2}\norm{\am\vx+\bm\vz-\vc}^2,\]

&lt;p&gt;alternating a minimization over $\vx$, a minimization over $\vz$, and a dual ascent step on $\vy$. ADMM is simple and effective, but its performance can deteriorate when the subproblems are ill-conditioned or lack strong convexity; as we will see, the plain $\vx$-subproblem for sparse PCA is &lt;em&gt;not&lt;/em&gt; something we can solve in closed form.&lt;/p&gt;

&lt;p&gt;ADPMM fixes this by adding a quadratic &lt;strong&gt;proximal term&lt;/strong&gt; to each primal update. For two positive semidefinite matrices $\gm$ and $\qm$, with $\norm{\vx}_\gm^2=\vx^\top\gm\vx$:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Algorithm 1 (ADPMM).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; initial $\vx,\vy,\vz$; penalty $\rho$; proximal matrices $\gm,\qm$.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;For $k=0,1,2,\ldots$:&lt;/p&gt;

    &lt;ol&gt;
      &lt;li&gt;
\[\vx^{k+1}\in\arg\min_\vx\Big\{h_1(\vx)+\frac{\rho}{2}\norm{\am\vx+\bm\vz^k-\vc+\tfrac{1}{\rho}\vy^k}^2+\frac{1}{2}\norm{\vx-\vx^k}_\gm^2\Big\}\]
      &lt;/li&gt;
      &lt;li&gt;
\[\vz^{k+1}\in\arg\min_\vz\Big\{h_2(\vz)+\frac{\rho}{2}\norm{\am\vx^{k+1}+\bm\vz-\vc+\tfrac{1}{\rho}\vy^k}^2+\frac{1}{2}\norm{\vz-\vz^k}_\qm^2\Big\}\]
      &lt;/li&gt;
      &lt;li&gt;
\[\vy^{k+1}=\vy^k+\rho(\am\vx^{k+1}+\bm\vz^{k+1}-\vc)\]
      &lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When $\gm=\qm=\mathbf{0}$, ADPMM degenerates to vanilla ADMM. The proximal terms play two roles: they &lt;strong&gt;stabilize&lt;/strong&gt; the updates by penalizing large deviations from the previous iterate, and they can be chosen to &lt;strong&gt;cancel inconvenient quadratic terms&lt;/strong&gt; in the objective, turning an otherwise hard subproblem into a closed-form one.&lt;/p&gt;

&lt;h2 id=&quot;solving-sparse-pca-and-ssc-via-adpmm&quot;&gt;Solving Sparse PCA and SSC via ADPMM&lt;/h2&gt;

&lt;p&gt;To put sparse PCA into the ADPMM template, introduce an auxiliary variable $\zm$ that carries the sparsity term, leaving the orthogonality constraint on $\xm$:&lt;/p&gt;

\[\begin{aligned}
\min_{\xm,\zm}\quad &amp;amp; -\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)+\lambda\norm{\zm}_1\\
\text{s.t.}\quad &amp;amp; \xm^\top\xm=\im,\qquad \xm=\zm.
\end{aligned}\]

&lt;p&gt;Now design the proximal matrices. In the $\xm$-update, the objective contributes the concave quadratic $-\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)$. Choosing&lt;/p&gt;

\[\gm=\am^\top\am\]

&lt;p&gt;&lt;strong&gt;exactly cancels this quadratic&lt;/strong&gt;: the proximal term $\frac{1}{2}\norm{\xm-\xm^k}_\gm^2$ expands to $\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)$ plus terms linear in $\xm$. What remains of the $\xm$-subproblem is a linear function of $\xm$ plus $\frac{\rho}{2}\norm{\xm-\zm^k+\tfrac{1}{\rho}\ym^k}_F^2$, restricted to the Stiefel manifold; this is the &lt;strong&gt;Orthogonal Procrustes problem&lt;/strong&gt;, solved in closed form by the SVD:&lt;/p&gt;

\[\xm^{k+1}=\um\vm^\top,
\qquad
\um\Sigma\vm^\top=\operatorname{svd}\big(\rho\zm^k-\ym^k+\gm\xm^k\big).\]

&lt;p&gt;Since the $\zm$-update has no quadratic term, we simply take $\qm=\mathbf{0}$, and the update reduces to element-wise &lt;strong&gt;soft-thresholding&lt;/strong&gt;:&lt;/p&gt;

\[\zm^{k+1}=\mathcal{S}_{\lambda/\rho}\Big(\xm^{k+1}+\tfrac{1}{\rho}\ym^k\Big),
\qquad
\mathcal{S}_{\tau}(\xm)=\sign(\xm)\odot\max(|\xm|-\tau,0).\]

&lt;p&gt;&lt;strong&gt;Algorithm 2 (ADPMM for Sparse PCA).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; initial $\xm,\ym,\zm$; penalty $\rho$; data matrix $\am$. Set $\gm=\am^\top\am$.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;For $k=0,1,2,\ldots$:&lt;/p&gt;

    &lt;ol&gt;
      &lt;li&gt;
        &lt;p&gt;Compute $\operatorname{svd}(\rho\zm^k-\ym^k+\gm\xm^k)=\um\Sigma\vm^\top$.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
\[\xm^{k+1}=\um\vm^\top \quad (\text{or via the Newton–Schulz iteration of Algorithm 3})\]
      &lt;/li&gt;
      &lt;li&gt;
\[\zm^{k+1}=\mathcal{S}_{\lambda/\rho}\big(\xm^{k+1}+\tfrac{1}{\rho}\ym^k\big)\]
      &lt;/li&gt;
      &lt;li&gt;
\[\ym^{k+1}=\ym^k+\rho(\xm^{k+1}-\zm^{k+1})\]
      &lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Sparse Spectral Clustering case is identical except that the quadratic is $+\frac{1}{2}\trace(\xm^\top\lm\xm)$, a &lt;em&gt;convex&lt;/em&gt; term we cancel with&lt;/p&gt;

\[\gm=\lambda_{\max}(\lm)\,\im-\lm,\]

&lt;p&gt;where the shift by $\lambda_{\max}(\lm)$ ensures $\gm\succeq 0$ so the proximal term is a valid Bregman-like penalty. Every other line of the algorithm is unchanged. Thus, &lt;strong&gt;one framework, two problems&lt;/strong&gt;; with the obvious substitutions, the unsupervised feature selection and compressed-modes models as well (for the $\ell_{2,1}$ norm, soft-thresholding is replaced by its row-wise group analogue).&lt;/p&gt;

&lt;p&gt;Each iteration therefore costs one matrix multiplication, one orthogonalization, and one entrywise shrinkage. &lt;strong&gt;No retractions, no line searches, no smoothing of the regularizer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;convergence-analysis&quot;&gt;Convergence Analysis&lt;/h2&gt;

&lt;p&gt;When $h_1$ and $h_2$ are proper, closed and convex and the proximal matrices make the subproblems strongly convex (our Hypothesis 1), ADPMM enjoys an ergodic $\mathcal{O}(1/n)$ rate in both objective gap and feasibility violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Theorem 1.&lt;/strong&gt;
Let ${(\vx^k,\vz^k)}$ be generated by Algorithm 2 (or its SSC variant), let $(\vx^*,\vz^*)$ be an optimal primal solution and $\vy^*$ an optimal dual solution. Then, under Hypothesis 1, for any $\gamma&amp;gt;2\norm{\vy^*}$ and $n\ge 0$:&lt;/p&gt;

\[H(\vx^{(n)},\vz^{(n)})-H(\vx^*,\vz^*)
\le
\frac{\norm{\vx^*-\vx^0}_\gm^2+\norm{\vz^*-\vz^0}_{\rho\im}^2+\frac{1}{\rho}(\gamma+\norm{\vy^0})^2}{2(n+1)},\]

\[\norm{\vx^{(n)}-\vz^{(n)}}
\le
\frac{\norm{\vx^*-\vx^0}_\gm^2+\norm{\vz^*-\vz^0}_{\rho\im}^2+\frac{1}{\rho}(\gamma+\norm{\vy^0})^2}{\gamma(n+1)},\]

&lt;p&gt;where $\vx^{(n)}=\frac{1}{n+1}\sum_{k=0}^n\vx^k$ and $\vz^{(n)}=\frac{1}{n+1}\sum_{k=0}^n\vz^k$ are the ergodic averages.&lt;/p&gt;

&lt;p&gt;The proof leans on the following classical “certificate” result [1], which converts an approximate saddle-point bound into separate bounds on suboptimality and infeasibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Theorem 2.&lt;/strong&gt;
Let $f_{\mathrm{opt}}$ be the optimal value of a convex problem $\min_\vx{f(\vx):g_i(\vx)\le 0,\;\am\vx+\vb=0}$ for which strong duality holds with optimal dual solution $(\vy^*,\vz^*)$. If for some $\delta&amp;gt;\norm{\vy^*}$, $\rho_1\ge 2\norm{\vy^*}_2$, $\rho_2\ge 2\norm{\vz^*}_2$ a point $\tilde\vx$ satisfies&lt;/p&gt;

\[f(\tilde\vx)-f_{\mathrm{opt}}+\rho_1\norm{\pos{\vg(\tilde\vx)}}_2+\rho_2\norm{\am\tilde\vx+\vb}_2\le\delta,\]

&lt;p&gt;then&lt;/p&gt;

\[f(\tilde\vx)-f_{\mathrm{opt}}\le\delta,
\qquad
\norm{\pos{\vg(\tilde\vx)}}_2\le\frac{2}{\rho_1}\delta,
\qquad
\norm{\am\tilde\vx+\vb}_2\le\frac{2}{\rho_2}\delta.\]

&lt;hr /&gt;

&lt;p&gt;&lt;strong&gt;Proof.&lt;/strong&gt;
The first bound is immediate from the non-negativity of the last two terms on the left-hand side. For the others, define the perturbation function&lt;/p&gt;

\[v(\vu,\vt)=\min_{\vx}\{f(\vx):\vg(\vx)\le\vu,\;\am\vx+\vb=\vt\}.\]

&lt;p&gt;Optimality of the dual pair gives $(-\vy^*,-\vz^*)\in\partial v(0,0)$, hence&lt;/p&gt;

\[v(\vu,\vt)-v(0,0)\ge\inner{-\vy^*}{\vu}+\inner{-\vz^*}{\vt}.\]

&lt;p&gt;Define $\tilde\vu=\pos{\vg(\tilde\vx)}$ and $\tilde\vt=\am\tilde\vx+\vb$. Then&lt;/p&gt;

\[\begin{aligned}
(\rho_1-\norm{\vy^*}_2)\norm{\tilde\vu}_2+(\rho_2-\norm{\vz^*}_2)\norm{\tilde\vt}_2
&amp;amp;\le \inner{-\vy^*}{\tilde\vu}+\inner{-\vz^*}{\tilde\vt}+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\\
&amp;amp;\le v(\tilde\vu,\tilde\vt)-v(0,0)+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\\
&amp;amp;\le f(\tilde\vx)-f_{\mathrm{opt}}+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\le\delta.
\end{aligned}\]

&lt;p&gt;Both summands on the left are non-negative, so each is at most $\delta$, and since $\rho_1-\norm{\vy^*}_2\ge\rho_1/2$ and $\rho_2-\norm{\vz^*}_2\ge\rho_2/2$ we conclude $\norm{\tilde\vu}_2\le\frac{2}{\rho_1}\delta$ and $\norm{\tilde\vt}_2\le\frac{2}{\rho_2}\delta$.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;With Theorem 2 in hand, the proof of Theorem 1 proceeds in four steps. &lt;strong&gt;(i)&lt;/strong&gt; Fermat’s optimality conditions for the two subproblems yield subgradient inequalities at $\vx^{k+1}$ and $\vz^{k+1}$, the first carrying the extra proximal term $\gm(\vx^{k+1}-\vx^k)$. &lt;strong&gt;(ii)&lt;/strong&gt; Adding them and applying, to each inner product, the three-point identity&lt;/p&gt;

\[(\mathbf{a}-\mathbf{b})^\top\mathbf{P}(\mathbf{c}-\mathbf{d})
=\frac{1}{2}\big(\norm{\mathbf{a}-\mathbf{d}}_{\mathbf{P}}^2-\norm{\mathbf{a}-\mathbf{c}}_{\mathbf{P}}^2+\norm{\mathbf{b}-\mathbf{c}}_{\mathbf{P}}^2-\norm{\mathbf{b}-\mathbf{d}}_{\mathbf{P}}^2\big)\]

&lt;p&gt;for $\mathbf{P}\in{\gm,\rho\im,\frac{1}{\rho}\im}$ produces a single inequality of the form&lt;/p&gt;

\[H(\vx,\vz)-H(\vx^{k+1},\vz^{k+1})+\inner{\vw-\tilde\vw^k}{\fm\vw}
\ge \frac{1}{2}\norm{\vw-\vw^{k+1}}_{\hm}^2-\frac{1}{2}\norm{\vw-\vw^k}_{\hm}^2,\]

&lt;p&gt;where $\vw=(\vx,\vz,\vy)$, $\hm=\diag(\gm,\rho\im,\frac{1}{\rho}\im)$, and $\fm$ is a &lt;strong&gt;skew-symmetric&lt;/strong&gt; matrix. Skew-symmetry is what lets us swap $\fm\tilde\vw^k$ for $\fm\vw$ at no cost. &lt;strong&gt;(iii)&lt;/strong&gt; Telescoping over $k=0,\ldots,n$ and invoking convexity of $H$ at the ergodic averages gives a bound that shrinks like $\frac{1}{2(n+1)}\norm{\vw-\vw^0}_{\hm}^2$. &lt;strong&gt;(iv)&lt;/strong&gt; Plugging in $(\vx^*,\vz^*)$, maximizing the dual variable over a ball of radius $\gamma&amp;gt;2\norm{\vy^*}$, and applying Theorem 2 splits the resulting bound into the two inequalities of Theorem 1.&lt;/p&gt;

&lt;p&gt;A remark is in order: the &lt;em&gt;original&lt;/em&gt; problems are nonconvex because of the Stiefel constraint, so Theorem 1 should be read as the guarantee inherited by the splitting under Hypothesis 1; recent global-convergence results for nonconvex ADMM provide partial justification for the full nonconvex setting, and empirically the iterates converge stably in all our experiments.&lt;/p&gt;

&lt;h2 id=&quot;alternatives-to-svd-a-newtonschulz-projection&quot;&gt;Alternatives to SVD: a Newton–Schulz Projection&lt;/h2&gt;

&lt;p&gt;The SVD in the $\xm$-update is exact but costs $\mathcal{O}(n^3)$ for an $n\times n$ input, which dominates the per-iteration time at scale. Any orthogonalization that maps $\pm\mapsto\um\vm^\top$ will do, and cheaper iterative candidates abound: QR, Modified Gram–Schmidt, polar decomposition, and the &lt;strong&gt;Newton–Schulz iteration&lt;/strong&gt;, which uses only matrix multiplications and is therefore extremely fast on modern hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Algorithm 3 (Newton–Schulz orthogonalization).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; $\pm=\rho\zm^k-\ym^k+\gm\xm^k\in\mathbb{R}^{n\times p}$; small $\varepsilon$; coefficients $a=1.9$, $b=-1.3$, $c=0.4$.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;If $n&amp;gt;p$, transpose: $\pm\leftarrow\pm^\top$ (iterate on the wider orientation).&lt;/li&gt;
  &lt;li&gt;Normalize: $\qm\leftarrow\pm/(\norm{\pm}_F+\varepsilon)$.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For $t=1,\ldots,5$:&lt;/p&gt;

\[\am\leftarrow\qm\qm^\top,\qquad \bm\leftarrow b\am+c\am^2,\qquad \qm\leftarrow a\qm+\bm\qm.\]
  &lt;/li&gt;
  &lt;li&gt;Transpose back if needed. &lt;strong&gt;Output:&lt;/strong&gt; $\xm^{k+1}=\qm$.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why does this work, and where do $(a,b,c)$ come from? Each inner step applies the odd quintic&lt;/p&gt;

\[\qm_{t+1}=a\qm_t+b\qm_t(\qm_t^\top\qm_t)+c\qm_t(\qm_t^\top\qm_t)^2,\]

&lt;p&gt;which, writing $\qm_t=\um\Sigma_t\vm^\top$, acts &lt;strong&gt;only on the singular values&lt;/strong&gt;:&lt;/p&gt;

\[\Sigma\mapsto p(\Sigma),\qquad p(\sigma)=a\sigma+b\sigma^3+c\sigma^5,\]

&lt;p&gt;while $\um,\vm$ stay fixed. Driving $\qm\to\um\vm^\top$ is exactly driving every singular value to $1$, i.e., making $\sigma=1$ an &lt;strong&gt;attracting fixed point&lt;/strong&gt; of $p$. We therefore impose&lt;/p&gt;

\[p(1)=a+b+c=1,
\qquad
p&apos;(1)=a+3b+5c=0,\]

&lt;p&gt;where the first condition makes $\sigma=1$ a fixed point and the second makes it &lt;em&gt;superattracting&lt;/em&gt;. Two linear conditions leave one degree of freedom; solving for $a,b$ in terms of $c$:&lt;/p&gt;

\[a=\tfrac{3}{2}+c,\qquad b=-\tfrac{1}{2}-2c.\]

&lt;p&gt;Since $p’(1)=0$, a Taylor expansion gives $p(1+\delta)=1+\tfrac{1}{2}p’‘(1)\delta^2+\mathcal{O}(\delta^3)$, so singular values near $1$ converge &lt;strong&gt;quadratically&lt;/strong&gt;. The Frobenius normalization in Step 2 places all singular values in $[0,1]$, where $\sigma=0$ is repelling because $p’(0)=a&amp;gt;1$; the iteration amplifies small singular values and locks large ones onto $1$. We use $c=0.4$, giving $(a,b,c)=(1.9,-1.3,0.4)$; the remaining freedom in $c$ trades off how aggressively small singular values are amplified against overshoot near $\sigma=1$.&lt;/p&gt;

&lt;p&gt;It is worth contrasting this with the widely used &lt;strong&gt;Muon&lt;/strong&gt; coefficients from the deep-learning optimizer literature: those are tuned only to push singular values into a &lt;em&gt;band&lt;/em&gt; around $1$ within a fixed iteration budget and deliberately violate $p’(1)=0$. Our solver must satisfy $\xm^\top\xm=\im$ &lt;em&gt;exactly&lt;/em&gt;, so we instead enforce an exact superattracting fixed point at $\sigma=1$. With this choice, $\norm{\xm^\top\xm-\im}_F$ reached machine precision within the five inner iterations in all experiments. &lt;strong&gt;NS-ADPMM matches the SVD variant’s constraint accuracy at a fraction of the cost&lt;/strong&gt;.&lt;/p&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/adpmm/polar_orthogonal_methods.png&quot; style=&quot;width:100%; height:350px; max-width:600px;&quot; alt=&quot;Comparison of SVD vs alternative orthogonalization methods inside ADPMM&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 1:&lt;/strong&gt;
      Comparison of SVD vs alternative orthogonalization methods (QR, MGS, polar, Newton&amp;ndash;Schulz) inside ADPMM. Vanilla SVD is accurate but takes more time.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;experimental-results&quot;&gt;Experimental Results&lt;/h2&gt;

&lt;p&gt;We evaluate two instances of the framework: &lt;strong&gt;SVD-ADPMM&lt;/strong&gt; (exact Stiefel projection) and &lt;strong&gt;NS-ADPMM&lt;/strong&gt; (Newton–Schulz projection) against five state-of-the-art manifold baselines: ManPG, ManPG-Ada, RADMM, ARADMM, and OADMM. Datasets span text (News20, RCV1), images (MNIST, USPS), citation/collaboration/social graphs (Cora, ca-GrQc, Facebook), and synthetic problems with planted structure. All methods share the same fixed random initialization; we set $\rho=\lambda_{\max}(\am^\top\am)$ for SPCA and $\rho=\tfrac{1}{2}\lambda_{\max}(\lm)$ for SSC. Both formulations are minimizations, so &lt;strong&gt;lower curves are better&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;On News20, OADMM matches our methods per &lt;em&gt;iteration&lt;/em&gt;, but it pays for each iteration with a backtracking line search on the manifold, requiring repeated retractions and augmented-Lagrangian evaluations. The wall-clock plots tell the real story: NS-ADPMM and SVD-ADPMM converge significantly faster in time than every baseline.&lt;/p&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/adpmm/spca_news20_time.png&quot; style=&quot;width:100%; height:350px; max-width:600px;&quot; alt=&quot;Sparse PCA on News20: objective versus wall-clock time&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 2:&lt;/strong&gt;
      Sparse PCA on News20 (n=15,935, p=k=50): objective versus wall-clock time.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/adpmm/spca_rcv1_time.png&quot; style=&quot;width:100%; height:350px; max-width:600px;&quot; alt=&quot;Sparse PCA on RCV1: objective versus wall-clock time&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 3:&lt;/strong&gt;
      Sparse PCA on RCV1 (n=47,236, p=k=50), stressing scalability in the feature dimension.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;For Sparse Spectral Clustering we build Gaussian similarity graphs with bandwidth set to the median squared pairwise distance. On MNIST and other datasets, ADPMM with either projection variant converges to a lower objective in fewer iterations &lt;em&gt;and&lt;/em&gt; less time than the baselines, with NS-ADPMM the most efficient.&lt;/p&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/adpmm/ssc_mnist_time.png&quot; style=&quot;width:100%; height:350px; max-width:600px;&quot; alt=&quot;Sparse spectral clustering on MNIST: objective versus wall-clock time&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 4:&lt;/strong&gt;
      Sparse spectral clustering on MNIST (60,000 samples, 10 clusters): objective versus time.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1] Amir Beck. “First-order methods in optimization.” SIAM, 2017.&lt;/p&gt;

&lt;p&gt;[2] Stephen Boyd et al. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, 2011.&lt;/p&gt;

&lt;p&gt;[3] Shixiang Chen et al. “Proximal gradient method for nonsmooth optimization over the Stiefel manifold.” SIAM Journal on Optimization, 2020.&lt;/p&gt;

&lt;p&gt;[4] Nicholas J. Higham. “Functions of matrices: theory and computation.” SIAM, 2008.&lt;/p&gt;

&lt;p&gt;[5] Keller Jordan et al. “Muon: An optimizer for hidden layers in neural networks.” 2024.&lt;/p&gt;
</description>
        <pubDate>Fri, 12 Jun 2026 02:20:45 -0400</pubDate>
        <link>https://clair-clemson.github.io/blog/2026/06/12/ADPMM/</link>
        <guid isPermaLink="true">https://clair-clemson.github.io/blog/2026/06/12/ADPMM/</guid>
      </item>
     
    
     
      <item>
        <title>Simplex-Constrained Incoherent Matrix Factorization with Hybrid  Mirror Descent</title>
        <description>&lt;p&gt;Matrix factorization is a fundamental tool in data mining and machine learning, with applications in clustering, topic modeling, recommender systems, and interpretable representation learning. Given a data matrix $\mathbf{X}\in\mathbb{R}^{d\times n}$, a typical factorization model seeks a low-rank approximation&lt;/p&gt;

\[\newcommand{\xm}{\mathbf{X}}
\newcommand{\fm}{\mathbf{F}}
\newcommand{\gm}{\mathbf{G}}
\newcommand{\sm}{\mathbf{S}}
\newcommand{\um}{\mathbf{U}}
\newcommand{\am}{\mathbf{A}}
\newcommand{\ym}{\mathbf{Y}}
\newcommand{\zm}{\mathbf{Z}}
\newcommand{\vx}{\mathbf{x}}
\newcommand{\vf}{\mathbf{f}}
\newcommand{\vg}{\mathbf{g}}
\newcommand{\vq}{\mathbf{q}}
\newcommand{\vu}{\mathbf{u}}
\newcommand{\vv}{\mathbf{v}}
\newcommand{\va}{\mathbf{a}}
\newcommand{\ve}{\mathbf{e}}
\newcommand{\rank}{\operatorname{rank}}
\newcommand{\Tr}{\operatorname{Tr}}
\newcommand{\relint}{\operatorname{relint}}
\newcommand{\prox}{\operatorname{prox}}
\newcommand{\SVT}{\operatorname{SVT}}
\newcommand{\Phiobj}{\Phi}
\newcommand{\Domega}{D_{\omega}}
\newcommand{\norm}[1]{\left\|#1\right\|}
\newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle}
\newcommand{\pos}[1]{\left[#1\right]_+}\]

\[\xm\approx \fm\gm,\]

&lt;p&gt;where $\fm\in\mathbb{R}^{d\times k}$ contains basis components and $\gm\in\mathbb{R}^{k\times n}$ contains low-dimensional representations of the samples. Classical nonnegative matrix factorization (NMF) improves interpretability by imposing nonnegativity on both factors, while many variants further introduce sparsity, orthogonality, or graph regularization.&lt;/p&gt;

&lt;p&gt;Existing factorization methods still face two limitations. First, the learned basis components can be highly redundant. In NMF and related models, different columns of $\fm$ may become strongly correlated, making the learned representation less interpretable and less discriminative for downstream tasks such as clustering. Second, many classical algorithms, especially multiplicative-update methods, are tightly coupled with nonnegativity assumptions and are therefore less flexible for real-valued data or more general constrained factorization settings. These observations motivate a matrix factorization framework that learns diverse basis components, produces interpretable sample representations, and remains applicable beyond standard NMF.&lt;/p&gt;

&lt;h2 id=&quot;simplex-constrained-incoherent-matrix-factorization&quot;&gt;Simplex-Constrained Incoherent Matrix Factorization&lt;/h2&gt;

&lt;p&gt;In this blog, we will discuss &lt;em&gt;Simplex-Constrained Incoherent Matrix Factorization&lt;/em&gt;. The key idea is to represent each sample as a convex combination of basis vectors by constraining every column of $\gm$ to lie on the probability simplex:&lt;/p&gt;

\[\vg_j\in\Delta_k,\qquad
\Delta_k=\{\vg\in\mathbb{R}^k:\vg\ge 0,\ \mathbf{1}^{\top}\vg=1\}.\]

&lt;p&gt;Thus,
$
\vx_j\approx \fm\vg_j=\sum_{r=1}^k g_{rj}\vf_r,
$
where $g_{rj}$ measures the contribution of the $r$-th basis component
to the $j$-th sample. This simplex constraint gives $\gm$ a natural
probabilistic interpretation as a soft cluster-assignment matrix and
makes the representation useful for clustering and interpretable data
analysis. In contrast, K-means assigns each sample to a hard one-hot
indicator vector. Thus, our formulation can be viewed as a simplex-based
soft relaxation of the hard assignment used in K-means.&lt;/p&gt;

&lt;p&gt;In addition, to encourage diverse basis components, we introduce an incoherence regularizer based on the positive off-diagonal entries of the Gram matrix:&lt;/p&gt;

\[\left\|
\left[
\fm^\top \fm-\operatorname{diag}(\fm^\top \fm)
\right]_+
\right\|_F^2=
\sum_{r\neq s}
[\langle \vf_r,\vf_s\rangle]_+^2.\]

&lt;p&gt;Therefore, the regularizer &lt;strong&gt;discourages positive correlation among different basis vectors while not penalizing negative inner products&lt;/strong&gt;. Compared with a linear inner-product penalty, the squared positive-part formulation places stronger emphasis on highly positively correlated basis pairs and provides a simple smooth penalty away from the hinge point.&lt;/p&gt;

&lt;p&gt;We further impose a data-adaptive column norm constraint on $\fm$:
$
|\vf_i|_2\le R, \ i=1,\dots,k,
$
where $R$ can be chosen according to the scale of the data, such as $R=\max_j|\vx_j|_2$. This constraint controls the scale of the basis matrix without requiring data normalization. When nonnegative bases are desired, we additionally impose $\fm\ge 0$; for real-valued data, we simply remove this nonnegativity constraint. Therefore, the same model naturally covers &lt;strong&gt;both NMF-style and real-valued&lt;/strong&gt; matrix factorization settings.&lt;/p&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/f_constraint.svg&quot; style=&quot;width:100%; height:250px; max-width:600px;&quot; alt=&quot;Illustration of the data-adaptive basis norm constraint&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 0:&lt;/strong&gt;
      Illustration of the data-adaptive basis norm constraint.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The proposed optimization problem is&lt;/p&gt;

\[\begin{aligned}
\min_{\fm,\gm} \Phi(\fm,\gm)
		:= \quad
&amp;amp;
\frac{1}{2}\|\xm-\fm\gm\|_F^2
+
\lambda
\left\|
\left[
\fm^\top \fm-\operatorname{diag}(\fm^\top \fm)
\right]_+
\right\|_F^2
\end{aligned}\]

&lt;p&gt;subject to&lt;/p&gt;

\[\fm\in\mathcal{C}_F,\qquad
\vg_j\in\Delta_k,\quad j=1,\dots,n,\]

&lt;p&gt;where $\mathcal{C}_F$ denotes either the nonnegative scale-constrained set or the real-valued scale-constrained set. To optimize this model, we develop a hybrid projected-gradient and entropy mirror-descent algorithm as belows.&lt;/p&gt;

&lt;h2 id=&quot;algorithm-hybrid-projectedentropic-mirror-descent&quot;&gt;Algorithm: Hybrid Projected–Entropic Mirror Descent&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Data matrix $\xm$, rank $k$, parameters $\lambda, R, \epsilon$&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Factor matrices $\fm, \gm$&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Initialize $\fm^0 \in \mathcal{C}_F$ and $\vg_j^0 \in \operatorname{relint}(\Delta_k)$ for $j = 1,\ldots,n$.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For $t = 0,1,2,\ldots$:&lt;/p&gt;

    &lt;ol&gt;
      &lt;li&gt;
        &lt;p&gt;Compute&lt;/p&gt;

\[\sm^t =
\left[
(\fm^t)^\top \fm^t
-
\operatorname{diag}\big((\fm^t)^\top \fm^t\big)
\right]_+ .\]
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Compute&lt;/p&gt;

\[\nabla_{\fm}\Phi(\fm^t,\gm^t)
=
(\fm^t\gm^t-\xm)(\gm^t)^\top
+
4\lambda \fm^t\sm^t .\]
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Set $\alpha_t = 1/\overline{L}_F$ and update&lt;/p&gt;

\[\fm^{t+1}
=
\Pi_{\mathcal{C}_F}
\left(
\fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t)
\right).\]
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Set&lt;/p&gt;

\[\eta_t = \frac{1}{\|\fm^{t+1}\|_2^2+\epsilon}.\]
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;For $j = 1,\ldots,n$:&lt;/p&gt;

        &lt;ol&gt;
          &lt;li&gt;
            &lt;p&gt;Compute&lt;/p&gt;

\[\vq_j^{t+1}
=
(\fm^{t+1})^\top
(\fm^{t+1}\vg_j^t-\vx_j).\]
          &lt;/li&gt;
          &lt;li&gt;
            &lt;p&gt;For $r = 1,\ldots,k$, update&lt;/p&gt;

\[g_{rj}^{t+1}
=
\frac{
g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1})
}{
\sum_{\ell=1}^{k}
g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1})
}.\]
          &lt;/li&gt;
        &lt;/ol&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;If a stopping criterion is satisfied, &lt;strong&gt;break&lt;/strong&gt;.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ol&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Return $\fm^{t+1}, \gm^{t+1}$.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The proposed algorithm &lt;strong&gt;alternates between a projected-gradient step for
$\fm$ and an entropy mirror-descent step for $\gm$&lt;/strong&gt;. The $\fm$-update is
&lt;strong&gt;Euclidean&lt;/strong&gt; because $\mathcal{C}_F$ is a simple column-wise norm
constraint. The $\gm$-update is &lt;strong&gt;entropic&lt;/strong&gt; because each column $\vg_j$
lies on the probability simplex.&lt;/p&gt;

&lt;p&gt;For the entropy mirror step, we use the negative entropy function&lt;/p&gt;

\[\omega(\vg)
=\sum_{r=1}^{k} g_r\log g_r ,\]

&lt;p&gt;defined on the simplex. Its associated Bregman divergence is the
Kullback–Leibler divergence&lt;/p&gt;

\[D_{\omega}(\vu,\vv)=
\sum_{r=1}^{k}
u_r\log\frac{u_r}{v_r},
\qquad \vu,\vv\in\Delta_k.\]

&lt;p&gt;For the $\fm$-update, we use the Euclidean mirror map&lt;/p&gt;

\[\omega(\fm)=
\frac{1}{2}\|\fm\|_F^2,\]

&lt;p&gt;whose associated Bregman divergence is&lt;/p&gt;

\[D_{\omega}(\fm,\widetilde{\fm})=
\frac{1}{2}\|\fm-\widetilde{\fm}\|_F^2.\]

&lt;p&gt;At iteration $t$, given $(\fm^t,\gm^t)$, we first update $\fm$ by&lt;/p&gt;

\[\fm^{t+1}=
\Pi_{\mathcal{C}_F}
\left(
\fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t)
\right),\]

&lt;p&gt;where $\Pi_{\mathcal{C}_F}$ is the Euclidean projection onto
$\mathcal{C}_F$ which is separable across columns.&lt;/p&gt;

&lt;p&gt;For the real-valued scale-constrained case,&lt;/p&gt;

\[\mathcal{C}_F=
\left\{
\fm:
\|\vf_r\|_2\leq R,\; r=1,\ldots,k
\right\},\]

&lt;p&gt;the projection is&lt;/p&gt;

\[\Pi_{\mathcal{C}_F}(\va_r)=
\frac{\va_r}{\max\{1,\|\va_r\|_2/R\}},
\qquad r=1,\ldots,k .\]

&lt;p&gt;For the nonnegative case,&lt;/p&gt;

\[\mathcal{C}_F=
\left\{
\fm:
\fm\geq 0,\;
\|\vf_r\|_2\leq R,\; r=1,\ldots,k
\right\},\]

&lt;p&gt;we first threshold negative entries and then project onto the
$\ell_2$ ball:&lt;/p&gt;

\[\Pi_{\mathcal{C}_F}(\va_r)=
\frac{[\va_r]_+}{\max\{1,\|[\va_r]_+\|_2/R\}} .\]

&lt;p&gt;For each column $\vg_j$, we first compute&lt;/p&gt;

\[\vq_j^{t+1}=
\nabla_{\vg_j}
\frac{1}{2}\|\vx_j-\fm^{t+1}\vg_j^t\|_2^2=
(\fm^{t+1})^\top(\fm^{t+1}\vg_j^t-\vx_j),\]

&lt;p&gt;and update $\vg_j$ by entropy mirror descent:&lt;/p&gt;

\[\vg_j^{t+1}=
\arg\min_{\vg\in\Delta_k}
\left\{
\langle \vq_j^{t+1},\vg-\vg_j^t\rangle
+
\frac{1}{\eta_t}
D_{\omega}(\vg,\vg_j^t)
\right\}.\]

&lt;p&gt;This subproblem has the closed-form multiplicative-normalization update [1]:&lt;/p&gt;

\[g_{rj}^{t+1}=
\frac{
    g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1})
}{
    \sum_{\ell=1}^{k}
    g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1})
},
\qquad r=1,\ldots,k .\]

&lt;p&gt;Thus, &lt;strong&gt;simplex feasibility is preserved automatically&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;convergence-analysis&quot;&gt;Convergence Analysis&lt;/h2&gt;

&lt;p&gt;We now introduce Pinsker’s inequality [2], which will be used in the convergence analysis. For any $\vu,\vv\in\Delta_k$,&lt;/p&gt;

\[D_{\omega}(\vu,\vv)=
\sum_{r=1}^{k}u_r\log\frac{u_r}{v_r}
\geq
\frac{1}{2}\|\vu-\vv\|_1^2
\geq
\frac{1}{2}\|\vu-\vv\|_2^2 .\]

&lt;p&gt;&lt;strong&gt;Lemma 1.&lt;/strong&gt; Fix $\fm$ and let $\vg^+$ be generated from $\vg \in \Delta_k$ by&lt;/p&gt;

\[\vg^+=
\arg\min_{\vu \in \Delta_k}
\left\{
\langle \nabla \phi_{\fm}(\vg), \vu-\vg\rangle
+
\frac{1}{\eta}D_{\omega}(\vu,\vg)
\right\}.\]

&lt;p&gt;If $0 &amp;lt; \eta &amp;lt; 1/L_G(\fm)$, then&lt;/p&gt;

\[\phi_{\fm}(\vg^+)
\leq
\phi_{\fm}(\vg)-
\left(
\frac{1}{\eta}-
L_G(\fm)
\right)
D_{\omega}(\vg^+,\vg).\]

&lt;p&gt;Consequently,&lt;/p&gt;

\[\phi_{\fm}(\vg^+)
\leq
\phi_{\fm}(\vg)-
\frac{1}{2}
\left(
\frac{1}{\eta}-
L_G(\fm)
\right)
\|\vg^+-\vg\|_2^2.\]

&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;Proof.&lt;/strong&gt;&lt;br /&gt;
By the optimality of the mirror-descent subproblem, comparing the objective value at $\vg^+$ and $\vg$ gives&lt;/p&gt;

\[\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle
+
\frac{1}{\eta}D_{\omega}(\vg^+,\vg)
\leq 0 .\]

&lt;p&gt;Hence,&lt;/p&gt;

\[\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle
\leq-
\frac{1}{\eta}D_{\omega}(\vg^+,\vg).\]

&lt;p&gt;Using the smoothness inequality  with $\vu=\vg^+$ and $\vv=\vg$, we obtain&lt;/p&gt;

\[\begin{aligned}
\phi_{\fm}(\vg^+)
&amp;amp;\leq
\phi_{\fm}(\vg)+
\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle+
\frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2  \\
&amp;amp;\leq
\phi_{\fm}(\vg)-
\frac{1}{\eta}D_{\omega}(\vg^+,\vg)
+
\frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2 .
\end{aligned}\]

&lt;p&gt;By Pinsker’s inequality,&lt;/p&gt;

\[\|\vg^+-\vg\|_2^2
\leq
2D_{\omega}(\vg^+,\vg).\]

&lt;p&gt;Therefore,&lt;/p&gt;

\[\phi_{\fm}(\vg^+)
\leq
\phi_{\fm}(\vg)-
\left(
\frac{1}{\eta}-
L_G(\fm)
\right)
D_{\omega}(\vg^+,\vg),\]

&lt;p&gt;which proves the first part. Applying Pinsker’s inequality once more gives the second part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Corollary 1:&lt;/strong&gt;
	Fix $\fm$ and update all columns of $\gm$,
	if $0&amp;lt;\eta&amp;lt;1/|\fm|_2^2$, then&lt;/p&gt;

\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2
\leq
\frac{1}{2}\|\xm-\fm\gm\|_F^2-
\left(
\frac{1}{\eta}-
\|\fm\|_2^2
\right)
\sum_{j=1}^{n}
D_{\omega}(\vg_j^+,\vg_j).\]

&lt;p&gt;Consequently,&lt;/p&gt;

\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2
\leq
\frac{1}{2}\|\xm-\fm\gm\|_F^2-
\frac{1}{2}
\left(
\frac{1}{\eta}-
\|\fm\|_2^2
\right)
\|\gm^+-\gm\|_F^2 .\]

&lt;p&gt;&lt;strong&gt;Lemma 2:&lt;/strong&gt;
Let&lt;/p&gt;

\[\fm^+=
\Pi_{\mathcal{C}_F}
\left(
\fm-\alpha\nabla_{\fm}\Phi(\fm,\gm)
\right).\]

&lt;p&gt;If $0&amp;lt;\alpha&amp;lt;2/L_F(\gm)$, then [3]&lt;/p&gt;

\[\Phi(\fm^+,\gm)
\leq
\Phi(\fm,\gm)-
\left(
\frac{1}{\alpha}-
\frac{L_F(\gm)}{2}
\right)
\|\fm^+-\fm\|_F^2 .\]

&lt;p&gt;&lt;strong&gt;Theorem 1:&lt;/strong&gt;
Suppose that the stepsizes satisfy&lt;/p&gt;

\[0&amp;lt;\alpha_t&amp;lt;\frac{2}{L_F(\gm^t)}
\quad\text{and}\quad
0&amp;lt;\eta_t&amp;lt;\frac{1}{\|\fm^{t+1}\|_2^2}.\]

&lt;p&gt;Then the sequence generated by Algorithm 1 satisfies&lt;/p&gt;

\[\Phi(\fm^{t+1},\gm^{t+1})
\leq
\Phi(\fm^t,\gm^t).\]

&lt;p&gt;More precisely,&lt;/p&gt;

\[\Phi(\fm^{t+1},\gm^{t+1})
\leq
\Phi(\fm^t,\gm^t)-
c_F^t\|\fm^{t+1}-\fm^t\|_F^2-
c_G^t\|\gm^{t+1}-\gm^t\|_F^2 ,\]

&lt;p&gt;where&lt;/p&gt;

\[c_F^t=
\frac{1}{\alpha_t}-
\frac{L_F(\gm^t)}{2}
&amp;gt;0,
\qquad
c_G^t
=
\frac{1}{2}
\left(
\frac{1}{\eta_t}
-
\|\fm^{t+1}\|_2^2
\right)
&amp;gt;0 .\]

&lt;p&gt;&lt;strong&gt;Lemma 3:&lt;/strong&gt; A uniform upper bound for $\fm$ update is&lt;/p&gt;

\[L_F(\gm)
\le
\overline{L}_F
:=
n+12\lambda kR^2 .\]

&lt;p&gt;&lt;strong&gt;Theorem 2:&lt;/strong&gt;
	Assume that the stepsizes are chosen such that there exist constants
	$c_F&amp;gt;0$ and $c_G&amp;gt;0$ satisfying&lt;/p&gt;

\[c_F^t\geq c_F&amp;gt;0,
\qquad
c_G^t\geq c_G&amp;gt;0\]

&lt;p&gt;for all $t$. Then the sequence
${(\fm^t,\gm^t)}$ generated by Algorithm 1 is
bounded, the objective values ${\Phi(\fm^t,\gm^t)}$ converge, and&lt;/p&gt;

\[\|\fm^{t+1}-\fm^t\|_F\rightarrow 0,
\qquad
\|\gm^{t+1}-\gm^t\|_F\rightarrow 0 .\]

&lt;p&gt;Moreover, every accumulation point $(\fm^\star,\gm^\star)$ is a
stationary point in the sense that&lt;/p&gt;

\[\left\langle
\nabla_{\fm}\Phi(\fm^\star,\gm^\star),
\fm-\fm^\star
\right\rangle
\geq 0,
\qquad
\forall \fm\in\mathcal{C}_F,\]

&lt;p&gt;and&lt;/p&gt;

\[\left\langle
\nabla_{\vg_j}
\frac{1}{2}\|\vx_j-\fm^\star\vg_j^\star\|_2^2,
\vg-\vg_j^\star
\right\rangle
\geq 0,
\ 
\forall \vg\in\Delta_k,\  j=1,\ldots,n .\]

&lt;h2 id=&quot;experimental-results&quot;&gt;Experimental Results&lt;/h2&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/lambda.svg&quot; style=&quot;width:100%; height:250px; max-width:600px;&quot; alt=&quot;Effect of lambda on basis diversity and reconstruction&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 1:&lt;/strong&gt;
      Effect of &amp;lambda; on basis diversity on the Zoo dataset.
       Increasing &amp;lambda; reduces positive basis correlations.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/wine_G_heatmap.svg&quot; style=&quot;width:100%; height:250px; max-width:600px;&quot; alt=&quot;Heatmap of the learned coefficient matrix G on the Wine data set&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 2:&lt;/strong&gt;
      Heatmap of the learned coefficient matrix &lt;strong&gt;G&lt;/strong&gt; on Wine dataset.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/zoo_Figure_1.svg&quot; style=&quot;width:100%; height:250px; max-width:600px;&quot; alt=&quot;Visualization of learned basis vectors on the Zoo data set&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 3:&lt;/strong&gt;
      Visualization of learned basis vectors on the Zoo data set.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/seeds_Figure_1.svg&quot; style=&quot;width:100%; height:250px; max-width:600px;&quot; alt=&quot;Visualization of learned basis vectors on the Seeds data set&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 4:&lt;/strong&gt;
      Visualization of learned basis vectors on the Seeds data set.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/MD_vs_PG.svg&quot; style=&quot;width:100%; height:250px; max-width:800px;&quot; alt=&quot;Comparison between entropy mirror descent and Euclidean projected gradient&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 5:&lt;/strong&gt;
      Comparison between entropy mirror descent and Euclidean projected gradient
      for updating &lt;strong&gt;G&lt;/strong&gt;.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure style=&quot;text-align:center;&quot;&gt;
  &lt;img src=&quot;https://clair-clemson.github.io/blog/assets/MF/table.svg&quot; style=&quot;width:100%; height:250px; max-width:800px;&quot; alt=&quot;Clustering results on various datasets&quot; /&gt;
  &lt;figcaption&gt;
    &lt;em&gt;
      &lt;strong&gt;Figure 6:&lt;/strong&gt;
      Clustering results on various datasets.
    &lt;/em&gt;
  &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Nisheeth Vishnoi. “Algorithms for convex optimization.” Cambridge University Press, 2021.&lt;/p&gt;

&lt;p&gt;[2] Clément L. Canonne. “A short note on an inequality between KL and TV”.&lt;/p&gt;

&lt;p&gt;[3] Amir Beck. “First-order methods in optimization”.&lt;/p&gt;
</description>
        <pubDate>Tue, 09 Jun 2026 06:30:45 -0400</pubDate>
        <link>https://clair-clemson.github.io/blog/2026/06/09/MF/</link>
        <guid isPermaLink="true">https://clair-clemson.github.io/blog/2026/06/09/MF/</guid>
      </item>
     
    
     
    
     
      <item>
        <title>FISTA</title>
        <description>&lt;p&gt;&lt;em&gt;“Fast iterative shrinkage-thresholding algorithm”(FISTA)&lt;/em&gt; is a proximal gradient method that aims to solve convex optimization problems of the form:&lt;/p&gt;

\[\min_x f(x) = g(x) + h(x)\]

&lt;p&gt;where $g$ is a smooth convex function with a Lipschitz continuous gradient, and $h$ is a convex function that is possibly non-smooth but has a simple proximal operator.&lt;/p&gt;

&lt;h2 id=&quot;proximal-gradient--ista&quot;&gt;Proximal Gradient &amp;amp; ISTA&lt;/h2&gt;

&lt;p&gt;Let us start with the classical &lt;em&gt;proximal gradient&lt;/em&gt; method (also known as &lt;em&gt;ISTA&lt;/em&gt;), based on which the FISTA algorithm is built.&lt;/p&gt;

&lt;p&gt;For a given convex optimization problem
\(\min_x f(x) = g(x) + h(x)\)
where $g$ is differentiable, $\nabla g$ is L-Lipschitz and $h$ is not necessarily differentiable, proximal gradient method applies gradient descent on $g$. Its update goes in the following fashion:&lt;/p&gt;

\[\begin{aligned} x^{+} &amp;amp; =\underset{z}{\operatorname{argmin}}~\bar{g}_t(z)+h(z) \\ &amp;amp; =\underset{z}{\operatorname{argmin}}~g(x)+\nabla g(x)^T(z-x)+\frac{1}{2 t}\|z-x\|_2^2+h(z) \\ &amp;amp; =\underset{z}{\operatorname{argmin}}~\frac{1}{2 t}\|z-(x-t \nabla g(x))\|_2^2+h(z)\end{aligned}\]

&lt;p&gt;where the step size $t\leq\frac{1}{L}$. This can be written as a &lt;em&gt;proximal mapping&lt;/em&gt;:&lt;/p&gt;

\[\operatorname{prox}_{h, t}(x)=\underset{z}{\operatorname{argmin}}~ \frac{1}{2 t}\|x-z\|_2^2+h(z)\]

&lt;p&gt;Then the proximal gradient update can be written as
\(x^{(k)}=\operatorname{prox}_{h, t_k}\left(x^{(k-1)}-t_k \nabla g\left(x^{(k-1)}\right)\right)\)&lt;/p&gt;

&lt;p&gt;or similar to gradient descent:&lt;/p&gt;

&lt;p&gt;\(x^{(k)}=x^{(k-1)}-t_k \cdot G_{t_k}\left(x^{(k-1)}\right)\)
where $G_t(x)=\frac{x-\operatorname{prox}_{h, t}(x-t \nabla g(x))}{t}$ is the generalized gradient of $f$.&lt;/p&gt;

&lt;p&gt;For many important functions $h$, for example, $l_1$-norm for a vector, $l_{2,1}$ or nuclear norm for a matrix, there are closed-form proximal mapping $\operatorname{prox}_{h, t}$.&lt;/p&gt;

&lt;h2 id=&quot;fista&quot;&gt;FISTA&lt;/h2&gt;

&lt;p&gt;For a convex optimization problem:&lt;/p&gt;

\[\min_x f(x) = g(x) + h(x)\]

&lt;p&gt;where $\nabla g$ is L-Lipschitz, FISTA can be described as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;Step 0:
      &lt;ul&gt;
        &lt;li&gt;Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;Step $k(k \geq 1)$:
      &lt;ul&gt;
        &lt;li&gt;Compute
          &lt;ul&gt;
            &lt;li&gt;$x_k=p_L\left(y_k\right)$&lt;/li&gt;
            &lt;li&gt;$t_{k+1}  =\frac{1+\sqrt{1+4 t_k^2}}{2}$&lt;/li&gt;
            &lt;li&gt;$y_{k+1}  =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$&lt;/li&gt;
          &lt;/ul&gt;
        &lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The step size can also be determined using a backtracking rule:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;Step 0:
      &lt;ul&gt;
        &lt;li&gt;Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$, $\eta &amp;gt;1$, $L_0&amp;gt;0$.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;Step $k(k \geq 1)$:
      &lt;ul&gt;
        &lt;li&gt;Find the smallest nonnegative integers $i_k$ such that with $\bar{L}=\eta^{i_k} L_{k-1}$,
          &lt;ul&gt;
            &lt;li&gt;$F\left(p_{\bar{L}}\left(y_k\right)\right) \leq Q_{\bar{L}}\left(p_{\bar{L}}\left(y_k\right), y_k\right) $&lt;/li&gt;
          &lt;/ul&gt;
        &lt;/li&gt;
        &lt;li&gt;Set $L_k=\eta^{i_k} L_{k-1}$ and compute
          &lt;ul&gt;
            &lt;li&gt;$x_k  =p_{L_k}\left(y_k\right)$&lt;/li&gt;
            &lt;li&gt;$t_{k+1}  =\frac{1+\sqrt{1+4 t_k^2}}{2}$&lt;/li&gt;
            &lt;li&gt;$y_{k+1}  =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$&lt;/li&gt;
          &lt;/ul&gt;
        &lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;FISTA with fixed step size $t \leq 1 / L$ satisfies
\(f\left(x^{(k)}\right)-f^{\star} \leq \frac{2\left\|x^{(0)}-x^{\star}\right\|_2^2}{t(k+1)^2}\)
and same result holds for backtracking, with $t$ replaced by $\beta / L$. This means that FISTA achieves an optimal rate of $O(\frac{1}{k^2})$ or $O(\frac{1}{\sqrt \epsilon})$.&lt;/p&gt;

&lt;p&gt;The figures below show the comparison between ISTA and FISTA on lasso regression and lasso logistic regression. In both cases, $n=100, p=500$.&lt;/p&gt;

&lt;div style=&quot;text-align:center;display: flex;flex-wrap:wrap;justify-content:center&quot;&gt;
&lt;div&gt;

&lt;img src=&quot;https://clair-clemson.github.io/blog/assets/fista/prox-grad1.svg&quot; /&gt;
&lt;br /&gt;
&lt;i&gt;&lt;b&gt;Figure 1:&lt;/b&gt; lasso regression&lt;/i&gt;
&lt;/div&gt;
&lt;div&gt;

&lt;img src=&quot;https://clair-clemson.github.io/blog/assets/fista/prox-grad2.svg&quot; /&gt;
&lt;br /&gt;
&lt;i&gt;&lt;b&gt;Figure 2:&lt;/b&gt; lasso logistic regression&lt;/i&gt;

&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Beck, Amir, and Marc Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences, vol. 2, no. 1, Mar. 2009, pp. 183–202, https://doi.org/10.1137/080716542.&lt;/li&gt;
  &lt;li&gt;Ryan Tibshirani, “Proximal gradient descent”, Convex Optimization: Fall 2018, https://www.stat.cmu.edu/~ryantibs/convexopt-F18/&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sun, 07 May 2023 08:31:47 -0400</pubDate>
        <link>https://clair-clemson.github.io/blog/2023/05/07/FISTA/</link>
        <guid isPermaLink="true">https://clair-clemson.github.io/blog/2023/05/07/FISTA/</guid>
      </item>
     
    
     
    
  </channel>
</rss>
