no title

Sparse Orthogonality-Constrained Optimization via ADPMM

Fri, 12 Jun 2026 02:20:45 -0400

Optimization problems with orthogonality constraints arise throughout machine learning, including dimensionality reduction, clustering, and representation learning. These problems require the variable to lie on the Stiefel manifold, the set of matrices with orthonormal columns. While orthogonality improves interpretability and numerical stability, it also makes the feasible set nonconvex, and things get harder still when the objective contains a non-smooth sparsity-inducing term.

\[\newcommand{\xm}{\mathbf{X}} \newcommand{\am}{\mathbf{A}} \newcommand{\bm}{\mathbf{B}} \newcommand{\lm}{\mathbf{L}} \newcommand{\hm}{\mathbf{H}} \newcommand{\mm}{\mathbf{M}} \newcommand{\gm}{\mathbf{G}} \newcommand{\qm}{\mathbf{Q}} \newcommand{\pm}{\mathbf{P}} \newcommand{\um}{\mathbf{U}} \newcommand{\vm}{\mathbf{V}} \newcommand{\ym}{\mathbf{Y}} \newcommand{\zm}{\mathbf{Z}} \newcommand{\im}{\mathbf{I}} \newcommand{\fm}{\mathbf{F}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\vz}{\mathbf{z}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vt}{\mathbf{t}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vw}{\mathbf{w}} \newcommand{\trace}{\operatorname{tr}} \newcommand{\diag}{\operatorname{diag}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\pos}[1]{\left[#1\right]_+}\]

A prominent example is sparse principal component analysis (sparse PCA), which enhances interpretability by enforcing sparsity on the principal components while keeping them orthogonal:

\[\min_{\xm}\; -\frac{1}{2}\trace(\xm^\top \am^\top\am \xm)+\lambda\norm{\xm}_1 \quad\text{s.t.}\quad \xm^\top\xm=\im,\]

where $\am$ is the data matrix (each row a sample), $\xm$ collects the principal components, and $\lambda$ controls the sparsity level. Similarly, Sparse Spectral Clustering (SSC) injects sparsity into spectral embeddings to improve robustness and interpretability in graph-based learning:

\[\min_{\xm}\; \frac{1}{2}\trace(\xm^\top \lm \xm)+\lambda\norm{\xm}_1 \quad\text{s.t.}\quad \xm^\top\xm=\im,\]

where $\lm$ is a graph Laplacian. The same template covers many more models: unsupervised feature selection replaces the $\ell_1$ norm with the row-sparsity-promoting $\ell_{2,1}$ norm, and compressed modes in physics seeks spatially localized solutions of the independent-particle Schrödinger equation by taking the quadratic term to be a discretized Schrödinger operator $\hm$. In every case the difficulty is the same: a non-smooth sparsity term sitting on top of a nonconvex orthogonality constraint.

Existing solvers largely fall into two camps. Riemannian methods (ManPG, ManPG-Ada, RADMM, ARADMM, OADMM) operate directly on the Stiefel manifold and preserve feasibility via retractions or projections, but they typically need repeated manifold operations or per-iteration line searches, which become expensive at scale. Relaxation or splitting methods decouple sparsity from orthogonality, but may sacrifice feasibility or introduce approximation error (e.g., Moreau-envelope smoothing of the regularizer). In this blog we discuss an algorithm which employs the former: an Alternating Direction Proximal Method of Multipliers (ADPMM) that handles the orthogonality constraint without relaxation, and whose per-iteration work reduces to one Stiefel projection plus one element-wise soft-thresholding.

From ADMM to ADPMM

Recall the classical Alternating Direction Method of Multipliers (ADMM). For a separable problem

\[\min_{\vx,\vz}\; H(\vx,\vz)=h_1(\vx)+h_2(\vz) \quad\text{s.t.}\quad \am\vx+\bm\vz=\vc,\]

ADMM works on the augmented Lagrangian with dual variable $\vy$ and penalty $\rho>0$:

\[L_\rho(\vx,\vz,\vy)=h_1(\vx)+h_2(\vz)+\vy^\top(\am\vx+\bm\vz-\vc)+\frac{\rho}{2}\norm{\am\vx+\bm\vz-\vc}^2,\]

alternating a minimization over $\vx$, a minimization over $\vz$, and a dual ascent step on $\vy$. ADMM is simple and effective, but its performance can deteriorate when the subproblems are ill-conditioned or lack strong convexity; as we will see, the plain $\vx$-subproblem for sparse PCA is not something we can solve in closed form.

ADPMM fixes this by adding a quadratic proximal term to each primal update. For two positive semidefinite matrices $\gm$ and $\qm$, with $\norm{\vx}_\gm^2=\vx^\top\gm\vx$:

Algorithm 1 (ADPMM).

Input: initial $\vx,\vy,\vz$; penalty $\rho$; proximal matrices $\gm,\qm$.

For $k=0,1,2,\ldots$:
1. \[\vx^{k+1}\in\arg\min_\vx\Big\{h_1(\vx)+\frac{\rho}{2}\norm{\am\vx+\bm\vz^k-\vc+\tfrac{1}{\rho}\vy^k}^2+\frac{1}{2}\norm{\vx-\vx^k}_\gm^2\Big\}\]
2. \[\vz^{k+1}\in\arg\min_\vz\Big\{h_2(\vz)+\frac{\rho}{2}\norm{\am\vx^{k+1}+\bm\vz-\vc+\tfrac{1}{\rho}\vy^k}^2+\frac{1}{2}\norm{\vz-\vz^k}_\qm^2\Big\}\]
3. \[\vy^{k+1}=\vy^k+\rho(\am\vx^{k+1}+\bm\vz^{k+1}-\vc)\]

When $\gm=\qm=\mathbf{0}$, ADPMM degenerates to vanilla ADMM. The proximal terms play two roles: they stabilize the updates by penalizing large deviations from the previous iterate, and they can be chosen to cancel inconvenient quadratic terms in the objective, turning an otherwise hard subproblem into a closed-form one.

Solving Sparse PCA and SSC via ADPMM

To put sparse PCA into the ADPMM template, introduce an auxiliary variable $\zm$ that carries the sparsity term, leaving the orthogonality constraint on $\xm$:

\[\begin{aligned} \min_{\xm,\zm}\quad & -\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)+\lambda\norm{\zm}_1\\ \text{s.t.}\quad & \xm^\top\xm=\im,\qquad \xm=\zm. \end{aligned}\]

Now design the proximal matrices. In the $\xm$-update, the objective contributes the concave quadratic $-\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)$. Choosing

\[\gm=\am^\top\am\]

exactly cancels this quadratic: the proximal term $\frac{1}{2}\norm{\xm-\xm^k}_\gm^2$ expands to $\frac{1}{2}\trace(\xm^\top\am^\top\am\xm)$ plus terms linear in $\xm$. What remains of the $\xm$-subproblem is a linear function of $\xm$ plus $\frac{\rho}{2}\norm{\xm-\zm^k+\tfrac{1}{\rho}\ym^k}_F^2$, restricted to the Stiefel manifold; this is the Orthogonal Procrustes problem, solved in closed form by the SVD:

\[\xm^{k+1}=\um\vm^\top, \qquad \um\Sigma\vm^\top=\operatorname{svd}\big(\rho\zm^k-\ym^k+\gm\xm^k\big).\]

Since the $\zm$-update has no quadratic term, we simply take $\qm=\mathbf{0}$, and the update reduces to element-wise soft-thresholding:

\[\zm^{k+1}=\mathcal{S}_{\lambda/\rho}\Big(\xm^{k+1}+\tfrac{1}{\rho}\ym^k\Big), \qquad \mathcal{S}_{\tau}(\xm)=\sign(\xm)\odot\max(|\xm|-\tau,0).\]

Algorithm 2 (ADPMM for Sparse PCA).

Input: initial $\xm,\ym,\zm$; penalty $\rho$; data matrix $\am$. Set $\gm=\am^\top\am$.

For $k=0,1,2,\ldots$:
1. Compute $\operatorname{svd}(\rho\zm^k-\ym^k+\gm\xm^k)=\um\Sigma\vm^\top$.
2. \[\xm^{k+1}=\um\vm^\top \quad (\text{or via the Newton–Schulz iteration of Algorithm 3})\]
3. \[\zm^{k+1}=\mathcal{S}_{\lambda/\rho}\big(\xm^{k+1}+\tfrac{1}{\rho}\ym^k\big)\]
4. \[\ym^{k+1}=\ym^k+\rho(\xm^{k+1}-\zm^{k+1})\]

The Sparse Spectral Clustering case is identical except that the quadratic is $+\frac{1}{2}\trace(\xm^\top\lm\xm)$, a convex term we cancel with

\[\gm=\lambda_{\max}(\lm)\,\im-\lm,\]

where the shift by $\lambda_{\max}(\lm)$ ensures $\gm\succeq 0$ so the proximal term is a valid Bregman-like penalty. Every other line of the algorithm is unchanged. Thus, one framework, two problems; with the obvious substitutions, the unsupervised feature selection and compressed-modes models as well (for the $\ell_{2,1}$ norm, soft-thresholding is replaced by its row-wise group analogue).

Each iteration therefore costs one matrix multiplication, one orthogonalization, and one entrywise shrinkage. No retractions, no line searches, no smoothing of the regularizer.

Convergence Analysis

When $h_1$ and $h_2$ are proper, closed and convex and the proximal matrices make the subproblems strongly convex (our Hypothesis 1), ADPMM enjoys an ergodic $\mathcal{O}(1/n)$ rate in both objective gap and feasibility violation.

Theorem 1. Let ${(\vx^k,\vz^k)}$ be generated by Algorithm 2 (or its SSC variant), let $(\vx^*,\vz^*)$ be an optimal primal solution and $\vy^*$ an optimal dual solution. Then, under Hypothesis 1, for any $\gamma>2\norm{\vy^*}$ and $n\ge 0$:

\[H(\vx^{(n)},\vz^{(n)})-H(\vx^*,\vz^*) \le \frac{\norm{\vx^*-\vx^0}_\gm^2+\norm{\vz^*-\vz^0}_{\rho\im}^2+\frac{1}{\rho}(\gamma+\norm{\vy^0})^2}{2(n+1)},\] \[\norm{\vx^{(n)}-\vz^{(n)}} \le \frac{\norm{\vx^*-\vx^0}_\gm^2+\norm{\vz^*-\vz^0}_{\rho\im}^2+\frac{1}{\rho}(\gamma+\norm{\vy^0})^2}{\gamma(n+1)},\]

where $\vx^{(n)}=\frac{1}{n+1}\sum_{k=0}^n\vx^k$ and $\vz^{(n)}=\frac{1}{n+1}\sum_{k=0}^n\vz^k$ are the ergodic averages.

The proof leans on the following classical “certificate” result [1], which converts an approximate saddle-point bound into separate bounds on suboptimality and infeasibility.

Theorem 2. Let $f_{\mathrm{opt}}$ be the optimal value of a convex problem $\min_\vx{f(\vx):g_i(\vx)\le 0,\;\am\vx+\vb=0}$ for which strong duality holds with optimal dual solution $(\vy^*,\vz^*)$. If for some $\delta>\norm{\vy^*}$, $\rho_1\ge 2\norm{\vy^*}_2$, $\rho_2\ge 2\norm{\vz^*}_2$ a point $\tilde\vx$ satisfies

\[f(\tilde\vx)-f_{\mathrm{opt}}+\rho_1\norm{\pos{\vg(\tilde\vx)}}_2+\rho_2\norm{\am\tilde\vx+\vb}_2\le\delta,\]

then

\[f(\tilde\vx)-f_{\mathrm{opt}}\le\delta, \qquad \norm{\pos{\vg(\tilde\vx)}}_2\le\frac{2}{\rho_1}\delta, \qquad \norm{\am\tilde\vx+\vb}_2\le\frac{2}{\rho_2}\delta.\]

Proof. The first bound is immediate from the non-negativity of the last two terms on the left-hand side. For the others, define the perturbation function

\[v(\vu,\vt)=\min_{\vx}\{f(\vx):\vg(\vx)\le\vu,\;\am\vx+\vb=\vt\}.\]

Optimality of the dual pair gives $(-\vy^*,-\vz^*)\in\partial v(0,0)$, hence

\[v(\vu,\vt)-v(0,0)\ge\inner{-\vy^*}{\vu}+\inner{-\vz^*}{\vt}.\]

Define $\tilde\vu=\pos{\vg(\tilde\vx)}$ and $\tilde\vt=\am\tilde\vx+\vb$. Then

\[\begin{aligned} (\rho_1-\norm{\vy^*}_2)\norm{\tilde\vu}_2+(\rho_2-\norm{\vz^*}_2)\norm{\tilde\vt}_2 &\le \inner{-\vy^*}{\tilde\vu}+\inner{-\vz^*}{\tilde\vt}+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\\ &\le v(\tilde\vu,\tilde\vt)-v(0,0)+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\\ &\le f(\tilde\vx)-f_{\mathrm{opt}}+\rho_1\norm{\tilde\vu}_2+\rho_2\norm{\tilde\vt}_2\le\delta. \end{aligned}\]

Both summands on the left are non-negative, so each is at most $\delta$, and since $\rho_1-\norm{\vy^*}_2\ge\rho_1/2$ and $\rho_2-\norm{\vz^*}_2\ge\rho_2/2$ we conclude $\norm{\tilde\vu}_2\le\frac{2}{\rho_1}\delta$ and $\norm{\tilde\vt}_2\le\frac{2}{\rho_2}\delta$.

With Theorem 2 in hand, the proof of Theorem 1 proceeds in four steps. (i) Fermat’s optimality conditions for the two subproblems yield subgradient inequalities at $\vx^{k+1}$ and $\vz^{k+1}$, the first carrying the extra proximal term $\gm(\vx^{k+1}-\vx^k)$. (ii) Adding them and applying, to each inner product, the three-point identity

\[(\mathbf{a}-\mathbf{b})^\top\mathbf{P}(\mathbf{c}-\mathbf{d}) =\frac{1}{2}\big(\norm{\mathbf{a}-\mathbf{d}}_{\mathbf{P}}^2-\norm{\mathbf{a}-\mathbf{c}}_{\mathbf{P}}^2+\norm{\mathbf{b}-\mathbf{c}}_{\mathbf{P}}^2-\norm{\mathbf{b}-\mathbf{d}}_{\mathbf{P}}^2\big)\]

for $\mathbf{P}\in{\gm,\rho\im,\frac{1}{\rho}\im}$ produces a single inequality of the form

\[H(\vx,\vz)-H(\vx^{k+1},\vz^{k+1})+\inner{\vw-\tilde\vw^k}{\fm\vw} \ge \frac{1}{2}\norm{\vw-\vw^{k+1}}_{\hm}^2-\frac{1}{2}\norm{\vw-\vw^k}_{\hm}^2,\]

where $\vw=(\vx,\vz,\vy)$, $\hm=\diag(\gm,\rho\im,\frac{1}{\rho}\im)$, and $\fm$ is a skew-symmetric matrix. Skew-symmetry is what lets us swap $\fm\tilde\vw^k$ for $\fm\vw$ at no cost. (iii) Telescoping over $k=0,\ldots,n$ and invoking convexity of $H$ at the ergodic averages gives a bound that shrinks like $\frac{1}{2(n+1)}\norm{\vw-\vw^0}_{\hm}^2$. (iv) Plugging in $(\vx^*,\vz^*)$, maximizing the dual variable over a ball of radius $\gamma>2\norm{\vy^*}$, and applying Theorem 2 splits the resulting bound into the two inequalities of Theorem 1.

A remark is in order: the original problems are nonconvex because of the Stiefel constraint, so Theorem 1 should be read as the guarantee inherited by the splitting under Hypothesis 1; recent global-convergence results for nonconvex ADMM provide partial justification for the full nonconvex setting, and empirically the iterates converge stably in all our experiments.

Alternatives to SVD: a Newton–Schulz Projection

The SVD in the $\xm$-update is exact but costs $\mathcal{O}(n^3)$ for an $n\times n$ input, which dominates the per-iteration time at scale. Any orthogonalization that maps $\pm\mapsto\um\vm^\top$ will do, and cheaper iterative candidates abound: QR, Modified Gram–Schmidt, polar decomposition, and the Newton–Schulz iteration, which uses only matrix multiplications and is therefore extremely fast on modern hardware.

Algorithm 3 (Newton–Schulz orthogonalization).

Input: $\pm=\rho\zm^k-\ym^k+\gm\xm^k\in\mathbb{R}^{n\times p}$; small $\varepsilon$; coefficients $a=1.9$, $b=-1.3$, $c=0.4$.

If $n>p$, transpose: $\pm\leftarrow\pm^\top$ (iterate on the wider orientation).
Normalize: $\qm\leftarrow\pm/(\norm{\pm}_F+\varepsilon)$.
For $t=1,\ldots,5$:
\[\am\leftarrow\qm\qm^\top,\qquad \bm\leftarrow b\am+c\am^2,\qquad \qm\leftarrow a\qm+\bm\qm.\]
Transpose back if needed. Output: $\xm^{k+1}=\qm$.

Why does this work, and where do $(a,b,c)$ come from? Each inner step applies the odd quintic

\[\qm_{t+1}=a\qm_t+b\qm_t(\qm_t^\top\qm_t)+c\qm_t(\qm_t^\top\qm_t)^2,\]

which, writing $\qm_t=\um\Sigma_t\vm^\top$, acts only on the singular values:

\[\Sigma\mapsto p(\Sigma),\qquad p(\sigma)=a\sigma+b\sigma^3+c\sigma^5,\]

while $\um,\vm$ stay fixed. Driving $\qm\to\um\vm^\top$ is exactly driving every singular value to $1$, i.e., making $\sigma=1$ an attracting fixed point of $p$. We therefore impose

\[p(1)=a+b+c=1, \qquad p'(1)=a+3b+5c=0,\]

where the first condition makes $\sigma=1$ a fixed point and the second makes it superattracting. Two linear conditions leave one degree of freedom; solving for $a,b$ in terms of $c$:

\[a=\tfrac{3}{2}+c,\qquad b=-\tfrac{1}{2}-2c.\]

Since $p’(1)=0$, a Taylor expansion gives $p(1+\delta)=1+\tfrac{1}{2}p’‘(1)\delta^2+\mathcal{O}(\delta^3)$, so singular values near $1$ converge quadratically. The Frobenius normalization in Step 2 places all singular values in $[0,1]$, where $\sigma=0$ is repelling because $p’(0)=a>1$; the iteration amplifies small singular values and locks large ones onto $1$. We use $c=0.4$, giving $(a,b,c)=(1.9,-1.3,0.4)$; the remaining freedom in $c$ trades off how aggressively small singular values are amplified against overshoot near $\sigma=1$.

It is worth contrasting this with the widely used Muon coefficients from the deep-learning optimizer literature: those are tuned only to push singular values into a band around $1$ within a fixed iteration budget and deliberately violate $p’(1)=0$. Our solver must satisfy $\xm^\top\xm=\im$ exactly, so we instead enforce an exact superattracting fixed point at $\sigma=1$. With this choice, $\norm{\xm^\top\xm-\im}_F$ reached machine precision within the five inner iterations in all experiments. NS-ADPMM matches the SVD variant’s constraint accuracy at a fraction of the cost.

Figure 1: Comparison of SVD vs alternative orthogonalization methods (QR, MGS, polar, Newton–Schulz) inside ADPMM. Vanilla SVD is accurate but takes more time.

Experimental Results

We evaluate two instances of the framework: SVD-ADPMM (exact Stiefel projection) and NS-ADPMM (Newton–Schulz projection) against five state-of-the-art manifold baselines: ManPG, ManPG-Ada, RADMM, ARADMM, and OADMM. Datasets span text (News20, RCV1), images (MNIST, USPS), citation/collaboration/social graphs (Cora, ca-GrQc, Facebook), and synthetic problems with planted structure. All methods share the same fixed random initialization; we set $\rho=\lambda_{\max}(\am^\top\am)$ for SPCA and $\rho=\tfrac{1}{2}\lambda_{\max}(\lm)$ for SSC. Both formulations are minimizations, so lower curves are better.

On News20, OADMM matches our methods per iteration, but it pays for each iteration with a backtracking line search on the manifold, requiring repeated retractions and augmented-Lagrangian evaluations. The wall-clock plots tell the real story: NS-ADPMM and SVD-ADPMM converge significantly faster in time than every baseline.

Figure 2: Sparse PCA on News20 (n=15,935, p=k=50): objective versus wall-clock time.

Figure 3: Sparse PCA on RCV1 (n=47,236, p=k=50), stressing scalability in the feature dimension.

For Sparse Spectral Clustering we build Gaussian similarity graphs with bandwidth set to the median squared pairwise distance. On MNIST and other datasets, ADPMM with either projection variant converges to a lower objective in fewer iterations and less time than the baselines, with NS-ADPMM the most efficient.

Figure 4: Sparse spectral clustering on MNIST (60,000 samples, 10 clusters): objective versus time.

References

[1] Amir Beck. “First-order methods in optimization.” SIAM, 2017.

[2] Stephen Boyd et al. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, 2011.

[3] Shixiang Chen et al. “Proximal gradient method for nonsmooth optimization over the Stiefel manifold.” SIAM Journal on Optimization, 2020.

[4] Nicholas J. Higham. “Functions of matrices: theory and computation.” SIAM, 2008.

[5] Keller Jordan et al. “Muon: An optimizer for hidden layers in neural networks.” 2024.

Simplex-Constrained Incoherent Matrix Factorization with Hybrid Mirror Descent

Tue, 09 Jun 2026 06:30:45 -0400

Matrix factorization is a fundamental tool in data mining and machine learning, with applications in clustering, topic modeling, recommender systems, and interpretable representation learning. Given a data matrix $\mathbf{X}\in\mathbb{R}^{d\times n}$, a typical factorization model seeks a low-rank approximation

\[\newcommand{\xm}{\mathbf{X}} \newcommand{\fm}{\mathbf{F}} \newcommand{\gm}{\mathbf{G}} \newcommand{\sm}{\mathbf{S}} \newcommand{\um}{\mathbf{U}} \newcommand{\am}{\mathbf{A}} \newcommand{\ym}{\mathbf{Y}} \newcommand{\zm}{\mathbf{Z}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vq}{\mathbf{q}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vv}{\mathbf{v}} \newcommand{\va}{\mathbf{a}} \newcommand{\ve}{\mathbf{e}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\Tr}{\operatorname{Tr}} \newcommand{\relint}{\operatorname{relint}} \newcommand{\prox}{\operatorname{prox}} \newcommand{\SVT}{\operatorname{SVT}} \newcommand{\Phiobj}{\Phi} \newcommand{\Domega}{D_{\omega}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\pos}[1]{\left[#1\right]_+}\] \[\xm\approx \fm\gm,\]

where $\fm\in\mathbb{R}^{d\times k}$ contains basis components and $\gm\in\mathbb{R}^{k\times n}$ contains low-dimensional representations of the samples. Classical nonnegative matrix factorization (NMF) improves interpretability by imposing nonnegativity on both factors, while many variants further introduce sparsity, orthogonality, or graph regularization.

Existing factorization methods still face two limitations. First, the learned basis components can be highly redundant. In NMF and related models, different columns of $\fm$ may become strongly correlated, making the learned representation less interpretable and less discriminative for downstream tasks such as clustering. Second, many classical algorithms, especially multiplicative-update methods, are tightly coupled with nonnegativity assumptions and are therefore less flexible for real-valued data or more general constrained factorization settings. These observations motivate a matrix factorization framework that learns diverse basis components, produces interpretable sample representations, and remains applicable beyond standard NMF.

Simplex-Constrained Incoherent Matrix Factorization

In this blog, we will discuss Simplex-Constrained Incoherent Matrix Factorization. The key idea is to represent each sample as a convex combination of basis vectors by constraining every column of $\gm$ to lie on the probability simplex:

\[\vg_j\in\Delta_k,\qquad \Delta_k=\{\vg\in\mathbb{R}^k:\vg\ge 0,\ \mathbf{1}^{\top}\vg=1\}.\]

Thus, $ \vx_j\approx \fm\vg_j=\sum_{r=1}^k g_{rj}\vf_r, $ where $g_{rj}$ measures the contribution of the $r$-th basis component to the $j$-th sample. This simplex constraint gives $\gm$ a natural probabilistic interpretation as a soft cluster-assignment matrix and makes the representation useful for clustering and interpretable data analysis. In contrast, K-means assigns each sample to a hard one-hot indicator vector. Thus, our formulation can be viewed as a simplex-based soft relaxation of the hard assignment used in K-means.

In addition, to encourage diverse basis components, we introduce an incoherence regularizer based on the positive off-diagonal entries of the Gram matrix:

\[\left\| \left[ \fm^\top \fm-\operatorname{diag}(\fm^\top \fm) \right]_+ \right\|_F^2= \sum_{r\neq s} [\langle \vf_r,\vf_s\rangle]_+^2.\]

Therefore, the regularizer discourages positive correlation among different basis vectors while not penalizing negative inner products. Compared with a linear inner-product penalty, the squared positive-part formulation places stronger emphasis on highly positively correlated basis pairs and provides a simple smooth penalty away from the hinge point.

We further impose a data-adaptive column norm constraint on $\fm$: $ |\vf_i|_2\le R, \ i=1,\dots,k, $ where $R$ can be chosen according to the scale of the data, such as $R=\max_j|\vx_j|_2$. This constraint controls the scale of the basis matrix without requiring data normalization. When nonnegative bases are desired, we additionally impose $\fm\ge 0$; for real-valued data, we simply remove this nonnegativity constraint. Therefore, the same model naturally covers both NMF-style and real-valued matrix factorization settings.

Figure 0: Illustration of the data-adaptive basis norm constraint.

The proposed optimization problem is

\[\begin{aligned} \min_{\fm,\gm} \Phi(\fm,\gm) := \quad & \frac{1}{2}\|\xm-\fm\gm\|_F^2 + \lambda \left\| \left[ \fm^\top \fm-\operatorname{diag}(\fm^\top \fm) \right]_+ \right\|_F^2 \end{aligned}\]

subject to

\[\fm\in\mathcal{C}_F,\qquad \vg_j\in\Delta_k,\quad j=1,\dots,n,\]

where $\mathcal{C}_F$ denotes either the nonnegative scale-constrained set or the real-valued scale-constrained set. To optimize this model, we develop a hybrid projected-gradient and entropy mirror-descent algorithm as belows.

Algorithm: Hybrid Projected–Entropic Mirror Descent

Input: Data matrix $\xm$, rank $k$, parameters $\lambda, R, \epsilon$

Output: Factor matrices $\fm, \gm$

Initialize $\fm^0 \in \mathcal{C}_F$ and $\vg_j^0 \in \operatorname{relint}(\Delta_k)$ for $j = 1,\ldots,n$.
For $t = 0,1,2,\ldots$:
1. Compute
  \[\sm^t = \left[ (\fm^t)^\top \fm^t - \operatorname{diag}\big((\fm^t)^\top \fm^t\big) \right]_+ .\]
2. Compute
  \[\nabla_{\fm}\Phi(\fm^t,\gm^t) = (\fm^t\gm^t-\xm)(\gm^t)^\top + 4\lambda \fm^t\sm^t .\]
3. Set $\alpha_t = 1/\overline{L}_F$ and update
  \[\fm^{t+1} = \Pi_{\mathcal{C}_F} \left( \fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t) \right).\]
4. Set
  \[\eta_t = \frac{1}{\|\fm^{t+1}\|_2^2+\epsilon}.\]
5. For $j = 1,\ldots,n$:
  1. Compute
    \[\vq_j^{t+1} = (\fm^{t+1})^\top (\fm^{t+1}\vg_j^t-\vx_j).\]
  2. For $r = 1,\ldots,k$, update
    \[g_{rj}^{t+1} = \frac{ g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1}) }{ \sum_{\ell=1}^{k} g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1}) }.\]
6. If a stopping criterion is satisfied, break.
Return $\fm^{t+1}, \gm^{t+1}$.

The proposed algorithm alternates between a projected-gradient step for $\fm$ and an entropy mirror-descent step for $\gm$. The $\fm$-update is Euclidean because $\mathcal{C}_F$ is a simple column-wise norm constraint. The $\gm$-update is entropic because each column $\vg_j$ lies on the probability simplex.

For the entropy mirror step, we use the negative entropy function

\[\omega(\vg) =\sum_{r=1}^{k} g_r\log g_r ,\]

defined on the simplex. Its associated Bregman divergence is the Kullback–Leibler divergence

\[D_{\omega}(\vu,\vv)= \sum_{r=1}^{k} u_r\log\frac{u_r}{v_r}, \qquad \vu,\vv\in\Delta_k.\]

For the $\fm$-update, we use the Euclidean mirror map

\[\omega(\fm)= \frac{1}{2}\|\fm\|_F^2,\]

whose associated Bregman divergence is

\[D_{\omega}(\fm,\widetilde{\fm})= \frac{1}{2}\|\fm-\widetilde{\fm}\|_F^2.\]

At iteration $t$, given $(\fm^t,\gm^t)$, we first update $\fm$ by

\[\fm^{t+1}= \Pi_{\mathcal{C}_F} \left( \fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t) \right),\]

where $\Pi_{\mathcal{C}_F}$ is the Euclidean projection onto $\mathcal{C}_F$ which is separable across columns.

For the real-valued scale-constrained case,

\[\mathcal{C}_F= \left\{ \fm: \|\vf_r\|_2\leq R,\; r=1,\ldots,k \right\},\]

the projection is

\[\Pi_{\mathcal{C}_F}(\va_r)= \frac{\va_r}{\max\{1,\|\va_r\|_2/R\}}, \qquad r=1,\ldots,k .\]

For the nonnegative case,

\[\mathcal{C}_F= \left\{ \fm: \fm\geq 0,\; \|\vf_r\|_2\leq R,\; r=1,\ldots,k \right\},\]

we first threshold negative entries and then project onto the $\ell_2$ ball:

\[\Pi_{\mathcal{C}_F}(\va_r)= \frac{[\va_r]_+}{\max\{1,\|[\va_r]_+\|_2/R\}} .\]

For each column $\vg_j$, we first compute

\[\vq_j^{t+1}= \nabla_{\vg_j} \frac{1}{2}\|\vx_j-\fm^{t+1}\vg_j^t\|_2^2= (\fm^{t+1})^\top(\fm^{t+1}\vg_j^t-\vx_j),\]

and update $\vg_j$ by entropy mirror descent:

\[\vg_j^{t+1}= \arg\min_{\vg\in\Delta_k} \left\{ \langle \vq_j^{t+1},\vg-\vg_j^t\rangle + \frac{1}{\eta_t} D_{\omega}(\vg,\vg_j^t) \right\}.\]

This subproblem has the closed-form multiplicative-normalization update [1]:

\[g_{rj}^{t+1}= \frac{ g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1}) }{ \sum_{\ell=1}^{k} g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1}) }, \qquad r=1,\ldots,k .\]

Thus, simplex feasibility is preserved automatically.

Convergence Analysis

We now introduce Pinsker’s inequality [2], which will be used in the convergence analysis. For any $\vu,\vv\in\Delta_k$,

\[D_{\omega}(\vu,\vv)= \sum_{r=1}^{k}u_r\log\frac{u_r}{v_r} \geq \frac{1}{2}\|\vu-\vv\|_1^2 \geq \frac{1}{2}\|\vu-\vv\|_2^2 .\]

Lemma 1. Fix $\fm$ and let $\vg^+$ be generated from $\vg \in \Delta_k$ by

\[\vg^+= \arg\min_{\vu \in \Delta_k} \left\{ \langle \nabla \phi_{\fm}(\vg), \vu-\vg\rangle + \frac{1}{\eta}D_{\omega}(\vu,\vg) \right\}.\]

If $0 < \eta < 1/L_G(\fm)$, then

\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \left( \frac{1}{\eta}- L_G(\fm) \right) D_{\omega}(\vg^+,\vg).\]

Consequently,

\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \frac{1}{2} \left( \frac{1}{\eta}- L_G(\fm) \right) \|\vg^+-\vg\|_2^2.\]

Proof.
By the optimality of the mirror-descent subproblem, comparing the objective value at $\vg^+$ and $\vg$ gives

\[\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle + \frac{1}{\eta}D_{\omega}(\vg^+,\vg) \leq 0 .\]

Hence,

\[\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle \leq- \frac{1}{\eta}D_{\omega}(\vg^+,\vg).\]

Using the smoothness inequality with $\vu=\vg^+$ and $\vv=\vg$, we obtain

\[\begin{aligned} \phi_{\fm}(\vg^+) &\leq \phi_{\fm}(\vg)+ \langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle+ \frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2 \\ &\leq \phi_{\fm}(\vg)- \frac{1}{\eta}D_{\omega}(\vg^+,\vg) + \frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2 . \end{aligned}\]

By Pinsker’s inequality,

\[\|\vg^+-\vg\|_2^2 \leq 2D_{\omega}(\vg^+,\vg).\]

Therefore,

\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \left( \frac{1}{\eta}- L_G(\fm) \right) D_{\omega}(\vg^+,\vg),\]

which proves the first part. Applying Pinsker’s inequality once more gives the second part.

Corollary 1: Fix $\fm$ and update all columns of $\gm$, if $0<\eta<1/|\fm|_2^2$, then

\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2 \leq \frac{1}{2}\|\xm-\fm\gm\|_F^2- \left( \frac{1}{\eta}- \|\fm\|_2^2 \right) \sum_{j=1}^{n} D_{\omega}(\vg_j^+,\vg_j).\]

Consequently,

\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2 \leq \frac{1}{2}\|\xm-\fm\gm\|_F^2- \frac{1}{2} \left( \frac{1}{\eta}- \|\fm\|_2^2 \right) \|\gm^+-\gm\|_F^2 .\]

Lemma 2: Let

\[\fm^+= \Pi_{\mathcal{C}_F} \left( \fm-\alpha\nabla_{\fm}\Phi(\fm,\gm) \right).\]

If $0<\alpha<2/L_F(\gm)$, then [3]

\[\Phi(\fm^+,\gm) \leq \Phi(\fm,\gm)- \left( \frac{1}{\alpha}- \frac{L_F(\gm)}{2} \right) \|\fm^+-\fm\|_F^2 .\]

Theorem 1: Suppose that the stepsizes satisfy

\[0<\alpha_t<\frac{2}{L_F(\gm^t)} \quad\text{and}\quad 0<\eta_t<\frac{1}{\|\fm^{t+1}\|_2^2}.\]

Then the sequence generated by Algorithm 1 satisfies

\[\Phi(\fm^{t+1},\gm^{t+1}) \leq \Phi(\fm^t,\gm^t).\]

More precisely,

\[\Phi(\fm^{t+1},\gm^{t+1}) \leq \Phi(\fm^t,\gm^t)- c_F^t\|\fm^{t+1}-\fm^t\|_F^2- c_G^t\|\gm^{t+1}-\gm^t\|_F^2 ,\]

where

\[c_F^t= \frac{1}{\alpha_t}- \frac{L_F(\gm^t)}{2} >0, \qquad c_G^t = \frac{1}{2} \left( \frac{1}{\eta_t} - \|\fm^{t+1}\|_2^2 \right) >0 .\]

Lemma 3: A uniform upper bound for $\fm$ update is

\[L_F(\gm) \le \overline{L}_F := n+12\lambda kR^2 .\]

Theorem 2: Assume that the stepsizes are chosen such that there exist constants $c_F>0$ and $c_G>0$ satisfying

\[c_F^t\geq c_F>0, \qquad c_G^t\geq c_G>0\]

for all $t$. Then the sequence ${(\fm^t,\gm^t)}$ generated by Algorithm 1 is bounded, the objective values ${\Phi(\fm^t,\gm^t)}$ converge, and

\[\|\fm^{t+1}-\fm^t\|_F\rightarrow 0, \qquad \|\gm^{t+1}-\gm^t\|_F\rightarrow 0 .\]

Moreover, every accumulation point $(\fm^\star,\gm^\star)$ is a stationary point in the sense that

\[\left\langle \nabla_{\fm}\Phi(\fm^\star,\gm^\star), \fm-\fm^\star \right\rangle \geq 0, \qquad \forall \fm\in\mathcal{C}_F,\]

and

\[\left\langle \nabla_{\vg_j} \frac{1}{2}\|\vx_j-\fm^\star\vg_j^\star\|_2^2, \vg-\vg_j^\star \right\rangle \geq 0, \ \forall \vg\in\Delta_k,\ j=1,\ldots,n .\]

Experimental Results

Figure 1: Effect of λ on basis diversity on the Zoo dataset. Increasing λ reduces positive basis correlations.

Figure 2: Heatmap of the learned coefficient matrix G on Wine dataset.

Figure 3: Visualization of learned basis vectors on the Zoo data set.

Figure 4: Visualization of learned basis vectors on the Seeds data set.

Figure 5: Comparison between entropy mirror descent and Euclidean projected gradient for updating G.

Figure 6: Clustering results on various datasets.

References

[1] Nisheeth Vishnoi. “Algorithms for convex optimization.” Cambridge University Press, 2021.

[2] Clément L. Canonne. “A short note on an inequality between KL and TV”.

[3] Amir Beck. “First-order methods in optimization”.

FISTA

Sun, 07 May 2023 08:31:47 -0400

“Fast iterative shrinkage-thresholding algorithm”(FISTA) is a proximal gradient method that aims to solve convex optimization problems of the form:

\[\min_x f(x) = g(x) + h(x)\]

where $g$ is a smooth convex function with a Lipschitz continuous gradient, and $h$ is a convex function that is possibly non-smooth but has a simple proximal operator.

Proximal Gradient & ISTA

Let us start with the classical proximal gradient method (also known as ISTA), based on which the FISTA algorithm is built.

For a given convex optimization problem $\min_x f(x) = g(x) + h(x)$ where $g$ is differentiable, $\nabla g$ is L-Lipschitz and $h$ is not necessarily differentiable, proximal gradient method applies gradient descent on $g$. Its update goes in the following fashion:

\[\begin{aligned} x^{+} & =\underset{z}{\operatorname{argmin}}~\bar{g}_t(z)+h(z) \\ & =\underset{z}{\operatorname{argmin}}~g(x)+\nabla g(x)^T(z-x)+\frac{1}{2 t}\|z-x\|_2^2+h(z) \\ & =\underset{z}{\operatorname{argmin}}~\frac{1}{2 t}\|z-(x-t \nabla g(x))\|_2^2+h(z)\end{aligned}\]

where the step size $t\leq\frac{1}{L}$. This can be written as a proximal mapping:

\[\operatorname{prox}_{h, t}(x)=\underset{z}{\operatorname{argmin}}~ \frac{1}{2 t}\|x-z\|_2^2+h(z)\]

Then the proximal gradient update can be written as $x^{(k)}=\operatorname{prox}_{h, t_k}\left(x^{(k-1)}-t_k \nabla g\left(x^{(k-1)}\right)\right)$

or similar to gradient descent:

$x^{(k)}=x^{(k-1)}-t_k \cdot G_{t_k}\left(x^{(k-1)}\right)$ where $G_t(x)=\frac{x-\operatorname{prox}_{h, t}(x-t \nabla g(x))}{t}$ is the generalized gradient of $f$.

For many important functions $h$, for example, $l_1$-norm for a vector, $l_{2,1}$ or nuclear norm for a matrix, there are closed-form proximal mapping $\operatorname{prox}_{h, t}$.

FISTA

For a convex optimization problem:

\[\min_x f(x) = g(x) + h(x)\]

where $\nabla g$ is L-Lipschitz, FISTA can be described as follows:

Step 0:

Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$.

Step $k(k \geq 1)$:

Compute

$x_k=p_L\left(y_k\right)$

$t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$

$y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

The step size can also be determined using a backtracking rule:

Step 0:

Take $y_1=x_0 \in \mathbb{R}^n, t_1=1$, $\eta >1$, $L_0>0$.

Step $k(k \geq 1)$:

Find the smallest nonnegative integers $i_k$ such that with $\bar{L}=\eta^{i_k} L_{k-1}$,

$F\left(p_{\bar{L}}\left(y_k\right)\right) \leq Q_{\bar{L}}\left(p_{\bar{L}}\left(y_k\right), y_k\right) $

Set $L_k=\eta^{i_k} L_{k-1}$ and compute

$x_k =p_{L_k}\left(y_k\right)$

$t_{k+1} =\frac{1+\sqrt{1+4 t_k^2}}{2}$

$y_{k+1} =x_k+\left(\frac{t_k-1}{t_{k+1}}\right)\left(x_k-x_{k-1}\right)$

FISTA with fixed step size $t \leq 1 / L$ satisfies $f\left(x^{(k)}\right)-f^{\star} \leq \frac{2\left\|x^{(0)}-x^{\star}\right\|_2^2}{t(k+1)^2}$ and same result holds for backtracking, with $t$ replaced by $\beta / L$. This means that FISTA achieves an optimal rate of $O(\frac{1}{k^2})$ or $O(\frac{1}{\sqrt \epsilon})$.

The figures below show the comparison between ISTA and FISTA on lasso regression and lasso logistic regression. In both cases, $n=100, p=500$.

Figure 1: lasso regression

Figure 2: lasso logistic regression

References

Beck, Amir, and Marc Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences, vol. 2, no. 1, Mar. 2009, pp. 183–202, https://doi.org/10.1137/080716542.
Ryan Tibshirani, “Proximal gradient descent”, Convex Optimization: Fall 2018, https://www.stat.cmu.edu/~ryantibs/convexopt-F18/