Matrix factorization is a fundamental tool in data mining and machine learning, with applications in clustering, topic modeling, recommender systems, and interpretable representation learning. Given a data matrix $\mathbf{X}\in\mathbb{R}^{d\times n}$, a typical factorization model seeks a low-rank approximation
\[\newcommand{\xm}{\mathbf{X}} \newcommand{\fm}{\mathbf{F}} \newcommand{\gm}{\mathbf{G}} \newcommand{\sm}{\mathbf{S}} \newcommand{\um}{\mathbf{U}} \newcommand{\am}{\mathbf{A}} \newcommand{\ym}{\mathbf{Y}} \newcommand{\zm}{\mathbf{Z}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vq}{\mathbf{q}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vv}{\mathbf{v}} \newcommand{\va}{\mathbf{a}} \newcommand{\ve}{\mathbf{e}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\Tr}{\operatorname{Tr}} \newcommand{\relint}{\operatorname{relint}} \newcommand{\prox}{\operatorname{prox}} \newcommand{\SVT}{\operatorname{SVT}} \newcommand{\Phiobj}{\Phi} \newcommand{\Domega}{D_{\omega}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\pos}[1]{\left[#1\right]_+}\] \[\xm\approx \fm\gm,\]where $\fm\in\mathbb{R}^{d\times k}$ contains basis components and $\gm\in\mathbb{R}^{k\times n}$ contains low-dimensional representations of the samples. Classical nonnegative matrix factorization (NMF) improves interpretability by imposing nonnegativity on both factors, while many variants further introduce sparsity, orthogonality, or graph regularization.
Existing factorization methods still face two limitations. First, the learned basis components can be highly redundant. In NMF and related models, different columns of $\fm$ may become strongly correlated, making the learned representation less interpretable and less discriminative for downstream tasks such as clustering. Second, many classical algorithms, especially multiplicative-update methods, are tightly coupled with nonnegativity assumptions and are therefore less flexible for real-valued data or more general constrained factorization settings. These observations motivate a matrix factorization framework that learns diverse basis components, produces interpretable sample representations, and remains applicable beyond standard NMF.
Simplex-Constrained Incoherent Matrix Factorization
In this blog, we will discuss Simplex-Constrained Incoherent Matrix Factorization. The key idea is to represent each sample as a convex combination of basis vectors by constraining every column of $\gm$ to lie on the probability simplex:
\[\vg_j\in\Delta_k,\qquad \Delta_k=\{\vg\in\mathbb{R}^k:\vg\ge 0,\ \mathbf{1}^{\top}\vg=1\}.\]Thus, $ \vx_j\approx \fm\vg_j=\sum_{r=1}^k g_{rj}\vf_r, $ where $g_{rj}$ measures the contribution of the $r$-th basis component to the $j$-th sample. This simplex constraint gives $\gm$ a natural probabilistic interpretation as a soft cluster-assignment matrix and makes the representation useful for clustering and interpretable data analysis. In contrast, K-means assigns each sample to a hard one-hot indicator vector. Thus, our formulation can be viewed as a simplex-based soft relaxation of the hard assignment used in K-means.
In addition, to encourage diverse basis components, we introduce an incoherence regularizer based on the positive off-diagonal entries of the Gram matrix:
\[\left\| \left[ \fm^\top \fm-\operatorname{diag}(\fm^\top \fm) \right]_+ \right\|_F^2= \sum_{r\neq s} [\langle \vf_r,\vf_s\rangle]_+^2.\]Therefore, the regularizer discourages positive correlation among different basis vectors while not penalizing negative inner products. Compared with a linear inner-product penalty, the squared positive-part formulation places stronger emphasis on highly positively correlated basis pairs and provides a simple smooth penalty away from the hinge point.
We further impose a data-adaptive column norm constraint on $\fm$: $ |\vf_i|_2\le R, \ i=1,\dots,k, $ where $R$ can be chosen according to the scale of the data, such as $R=\max_j|\vx_j|_2$. This constraint controls the scale of the basis matrix without requiring data normalization. When nonnegative bases are desired, we additionally impose $\fm\ge 0$; for real-valued data, we simply remove this nonnegativity constraint. Therefore, the same model naturally covers both NMF-style and real-valued matrix factorization settings.
The proposed optimization problem is
\[\begin{aligned} \min_{\fm,\gm} \Phi(\fm,\gm) := \quad & \frac{1}{2}\|\xm-\fm\gm\|_F^2 + \lambda \left\| \left[ \fm^\top \fm-\operatorname{diag}(\fm^\top \fm) \right]_+ \right\|_F^2 \end{aligned}\]subject to
\[\fm\in\mathcal{C}_F,\qquad \vg_j\in\Delta_k,\quad j=1,\dots,n,\]where $\mathcal{C}_F$ denotes either the nonnegative scale-constrained set or the real-valued scale-constrained set. To optimize this model, we develop a hybrid projected-gradient and entropy mirror-descent algorithm as belows.
Algorithm: Hybrid Projected–Entropic Mirror Descent
Input: Data matrix $\xm$, rank $k$, parameters $\lambda, R, \epsilon$
Output: Factor matrices $\fm, \gm$
-
Initialize $\fm^0 \in \mathcal{C}_F$ and $\vg_j^0 \in \operatorname{relint}(\Delta_k)$ for $j = 1,\ldots,n$.
-
For $t = 0,1,2,\ldots$:
-
Compute
\[\sm^t = \left[ (\fm^t)^\top \fm^t - \operatorname{diag}\big((\fm^t)^\top \fm^t\big) \right]_+ .\] -
Compute
\[\nabla_{\fm}\Phi(\fm^t,\gm^t) = (\fm^t\gm^t-\xm)(\gm^t)^\top + 4\lambda \fm^t\sm^t .\] -
Set $\alpha_t = 1/\overline{L}_F$ and update
\[\fm^{t+1} = \Pi_{\mathcal{C}_F} \left( \fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t) \right).\] -
Set
\[\eta_t = \frac{1}{\|\fm^{t+1}\|_2^2+\epsilon}.\] -
For $j = 1,\ldots,n$:
-
Compute
\[\vq_j^{t+1} = (\fm^{t+1})^\top (\fm^{t+1}\vg_j^t-\vx_j).\] -
For $r = 1,\ldots,k$, update
\[g_{rj}^{t+1} = \frac{ g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1}) }{ \sum_{\ell=1}^{k} g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1}) }.\]
-
-
If a stopping criterion is satisfied, break.
-
-
Return $\fm^{t+1}, \gm^{t+1}$.
The proposed algorithm alternates between a projected-gradient step for $\fm$ and an entropy mirror-descent step for $\gm$. The $\fm$-update is Euclidean because $\mathcal{C}_F$ is a simple column-wise norm constraint. The $\gm$-update is entropic because each column $\vg_j$ lies on the probability simplex.
For the entropy mirror step, we use the negative entropy function
\[\omega(\vg) =\sum_{r=1}^{k} g_r\log g_r ,\]defined on the simplex. Its associated Bregman divergence is the Kullback–Leibler divergence
\[D_{\omega}(\vu,\vv)= \sum_{r=1}^{k} u_r\log\frac{u_r}{v_r}, \qquad \vu,\vv\in\Delta_k.\]For the $\fm$-update, we use the Euclidean mirror map
\[\omega(\fm)= \frac{1}{2}\|\fm\|_F^2,\]whose associated Bregman divergence is
\[D_{\omega}(\fm,\widetilde{\fm})= \frac{1}{2}\|\fm-\widetilde{\fm}\|_F^2.\]At iteration $t$, given $(\fm^t,\gm^t)$, we first update $\fm$ by
\[\fm^{t+1}= \Pi_{\mathcal{C}_F} \left( \fm^t-\alpha_t\nabla_{\fm}\Phi(\fm^t,\gm^t) \right),\]where $\Pi_{\mathcal{C}_F}$ is the Euclidean projection onto $\mathcal{C}_F$ which is separable across columns.
For the real-valued scale-constrained case,
\[\mathcal{C}_F= \left\{ \fm: \|\vf_r\|_2\leq R,\; r=1,\ldots,k \right\},\]the projection is
\[\Pi_{\mathcal{C}_F}(\va_r)= \frac{\va_r}{\max\{1,\|\va_r\|_2/R\}}, \qquad r=1,\ldots,k .\]For the nonnegative case,
\[\mathcal{C}_F= \left\{ \fm: \fm\geq 0,\; \|\vf_r\|_2\leq R,\; r=1,\ldots,k \right\},\]we first threshold negative entries and then project onto the $\ell_2$ ball:
\[\Pi_{\mathcal{C}_F}(\va_r)= \frac{[\va_r]_+}{\max\{1,\|[\va_r]_+\|_2/R\}} .\]For each column $\vg_j$, we first compute
\[\vq_j^{t+1}= \nabla_{\vg_j} \frac{1}{2}\|\vx_j-\fm^{t+1}\vg_j^t\|_2^2= (\fm^{t+1})^\top(\fm^{t+1}\vg_j^t-\vx_j),\]and update $\vg_j$ by entropy mirror descent:
\[\vg_j^{t+1}= \arg\min_{\vg\in\Delta_k} \left\{ \langle \vq_j^{t+1},\vg-\vg_j^t\rangle + \frac{1}{\eta_t} D_{\omega}(\vg,\vg_j^t) \right\}.\]This subproblem has the closed-form multiplicative-normalization update [1]:
\[g_{rj}^{t+1}= \frac{ g_{rj}^{t}\exp(-\eta_t q_{rj}^{t+1}) }{ \sum_{\ell=1}^{k} g_{\ell j}^{t}\exp(-\eta_t q_{\ell j}^{t+1}) }, \qquad r=1,\ldots,k .\]Thus, simplex feasibility is preserved automatically.
Convergence Analysis
We now introduce Pinsker’s inequality [2], which will be used in the convergence analysis. For any $\vu,\vv\in\Delta_k$,
\[D_{\omega}(\vu,\vv)= \sum_{r=1}^{k}u_r\log\frac{u_r}{v_r} \geq \frac{1}{2}\|\vu-\vv\|_1^2 \geq \frac{1}{2}\|\vu-\vv\|_2^2 .\]Lemma 1. Fix $\fm$ and let $\vg^+$ be generated from $\vg \in \Delta_k$ by
\[\vg^+= \arg\min_{\vu \in \Delta_k} \left\{ \langle \nabla \phi_{\fm}(\vg), \vu-\vg\rangle + \frac{1}{\eta}D_{\omega}(\vu,\vg) \right\}.\]If $0 < \eta < 1/L_G(\fm)$, then
\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \left( \frac{1}{\eta}- L_G(\fm) \right) D_{\omega}(\vg^+,\vg).\]Consequently,
\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \frac{1}{2} \left( \frac{1}{\eta}- L_G(\fm) \right) \|\vg^+-\vg\|_2^2.\]Proof.
By the optimality of the mirror-descent subproblem, comparing the objective value at $\vg^+$ and $\vg$ gives
Hence,
\[\langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle \leq- \frac{1}{\eta}D_{\omega}(\vg^+,\vg).\]Using the smoothness inequality with $\vu=\vg^+$ and $\vv=\vg$, we obtain
\[\begin{aligned} \phi_{\fm}(\vg^+) &\leq \phi_{\fm}(\vg)+ \langle \nabla\phi_{\fm}(\vg), \vg^+-\vg\rangle+ \frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2 \\ &\leq \phi_{\fm}(\vg)- \frac{1}{\eta}D_{\omega}(\vg^+,\vg) + \frac{L_G(\fm)}{2}\|\vg^+-\vg\|_2^2 . \end{aligned}\]By Pinsker’s inequality,
\[\|\vg^+-\vg\|_2^2 \leq 2D_{\omega}(\vg^+,\vg).\]Therefore,
\[\phi_{\fm}(\vg^+) \leq \phi_{\fm}(\vg)- \left( \frac{1}{\eta}- L_G(\fm) \right) D_{\omega}(\vg^+,\vg),\]which proves the first part. Applying Pinsker’s inequality once more gives the second part.
Corollary 1: Fix $\fm$ and update all columns of $\gm$, if $0<\eta<1/|\fm|_2^2$, then
\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2 \leq \frac{1}{2}\|\xm-\fm\gm\|_F^2- \left( \frac{1}{\eta}- \|\fm\|_2^2 \right) \sum_{j=1}^{n} D_{\omega}(\vg_j^+,\vg_j).\]Consequently,
\[\frac{1}{2}\|\xm-\fm\gm^+\|_F^2 \leq \frac{1}{2}\|\xm-\fm\gm\|_F^2- \frac{1}{2} \left( \frac{1}{\eta}- \|\fm\|_2^2 \right) \|\gm^+-\gm\|_F^2 .\]Lemma 2: Let
\[\fm^+= \Pi_{\mathcal{C}_F} \left( \fm-\alpha\nabla_{\fm}\Phi(\fm,\gm) \right).\]If $0<\alpha<2/L_F(\gm)$, then [3]
\[\Phi(\fm^+,\gm) \leq \Phi(\fm,\gm)- \left( \frac{1}{\alpha}- \frac{L_F(\gm)}{2} \right) \|\fm^+-\fm\|_F^2 .\]Theorem 1: Suppose that the stepsizes satisfy
\[0<\alpha_t<\frac{2}{L_F(\gm^t)} \quad\text{and}\quad 0<\eta_t<\frac{1}{\|\fm^{t+1}\|_2^2}.\]Then the sequence generated by Algorithm 1 satisfies
\[\Phi(\fm^{t+1},\gm^{t+1}) \leq \Phi(\fm^t,\gm^t).\]More precisely,
\[\Phi(\fm^{t+1},\gm^{t+1}) \leq \Phi(\fm^t,\gm^t)- c_F^t\|\fm^{t+1}-\fm^t\|_F^2- c_G^t\|\gm^{t+1}-\gm^t\|_F^2 ,\]where
\[c_F^t= \frac{1}{\alpha_t}- \frac{L_F(\gm^t)}{2} >0, \qquad c_G^t = \frac{1}{2} \left( \frac{1}{\eta_t} - \|\fm^{t+1}\|_2^2 \right) >0 .\]Lemma 3: A uniform upper bound for $\fm$ update is
\[L_F(\gm) \le \overline{L}_F := n+12\lambda kR^2 .\]Theorem 2: Assume that the stepsizes are chosen such that there exist constants $c_F>0$ and $c_G>0$ satisfying
\[c_F^t\geq c_F>0, \qquad c_G^t\geq c_G>0\]for all $t$. Then the sequence ${(\fm^t,\gm^t)}$ generated by Algorithm 1 is bounded, the objective values ${\Phi(\fm^t,\gm^t)}$ converge, and
\[\|\fm^{t+1}-\fm^t\|_F\rightarrow 0, \qquad \|\gm^{t+1}-\gm^t\|_F\rightarrow 0 .\]Moreover, every accumulation point $(\fm^\star,\gm^\star)$ is a stationary point in the sense that
\[\left\langle \nabla_{\fm}\Phi(\fm^\star,\gm^\star), \fm-\fm^\star \right\rangle \geq 0, \qquad \forall \fm\in\mathcal{C}_F,\]and
\[\left\langle \nabla_{\vg_j} \frac{1}{2}\|\vx_j-\fm^\star\vg_j^\star\|_2^2, \vg-\vg_j^\star \right\rangle \geq 0, \ \forall \vg\in\Delta_k,\ j=1,\ldots,n .\]Experimental Results
References
[1] Nisheeth Vishnoi. “Algorithms for convex optimization.” Cambridge University Press, 2021.
[2] Clément L. Canonne. “A short note on an inequality between KL and TV”.
[3] Amir Beck. “First-order methods in optimization”.