The diffusion map is an extensively used dimensionality reduction technique. It represents high-dimensional data points using an underlying graph, which non-linearly encodes the geometrical similarities between data points. Despite its popularity, few tutorials introduce the diffusion map in precision that I find satisfying. This blog delves deep into the derivation.

To formalize, we let X{x1,,xm}RN\mathcal{X} \coloneqq \{ \boldsymbol{x}_1, \ldots, \boldsymbol{x}_m\} \subset \mathbb{R}^N be a given set of high-dimensional data. In this note, we will construct the diffusion maps

Φn:XRnxΦ(x),\begin{equation*} \begin{split} \Phi^n: \mathcal{X} & \to \mathbb{R}^{n} \\ \boldsymbol{x} & \mapsto \Phi(\boldsymbol{x}), \end{split} \end{equation*}

where nmin(m1,N)n \leq \min(m - 1, N). Specifically, to construct the diffusion maps Φn\Phi^n, we take three steps:

  1. View the high-dimensional points in X\mathcal{X} as vertices in an underlying graph (Section 1 ).
  2. Define a Markov Chain (MC) on the graph, so that local connectivity between vertices of the graph has a probability interpretation (Section 2 ).
  3. Assign each data point in X\mathcal{X} a low-dimensional vector, with respect to which the graph-based local connectivity is approximately preserved (Section 3).

1. Graph representation of data points

To start the diffusion maps machinery, we first describe each high-dimensional data point xiRN\boldsymbol{x}_i \in \mathbb{R}^{N} by its relationship with other data points in the given dataset X{x1,,xm}\mathcal{X} \coloneqq \{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_m\}. To this end, we assume that each data point xi\boldsymbol{x}_i is emitted from a vertex viv_i of an underlying directed weighted graph G=(V,E,W)G = (V, E, \boldsymbol{W}). Here, V{v1,,vm}V \coloneqq \{v_1, \ldots, v_m\} is the set of vertices, EV×VE \subset V \times V is a set of ordered pairs of vertices representing edges, and W\boldsymbol{W} is the adjacency matrix such that Wij>0\boldsymbol{W}_{ij} > 0 if (i,j)E(i,j) \in E and Wij=0\boldsymbol{W}_{ij} = 0 otherwise.

One advantage of having an underlying graph representation of data is that the similarities of high-dimensional points xi\boldsymbol{x}_i and xj\boldsymbol{x}_j can be encapsulated by Wij\boldsymbol{W}_{ij}, that is, the connection strength or the weight between two vertices viv_i and vjv_j. But how to specify these weights? In diffusion maps, we let each connection strength be specified through a kernel function KK:

K:RN×RNR+(xi,xj)K(xi,xj).\begin{aligned} K: \mathbb{R}^N \times \mathbb{R}^N & \to \mathbb{R}_{+} \cr (\boldsymbol{x}_i, \boldsymbol{x}_j) & \mapsto K(\boldsymbol{x}_i, \boldsymbol{x}_j). \end{aligned}

Note that here we assume that the kernel KK only takes positive values. This is a convenient choice as it guarantees the Markov chain to be defined later in to have a unique stationary state. One common choice of the kernel is the radial basis kernel K(xi,xj)=exp(xixj2ϵ) with ϵ>0.K(\boldsymbol{x}_i, \boldsymbol{x}_j)=\exp \left(-\frac{\|\boldsymbol{x}_i-\boldsymbol{x}_j\|^{2}}{\epsilon}\right) \text{~with~} \epsilon > 0. In this way, we construct an adjacency matrix with components WijK(xi,xj)\boldsymbol{W}_{ij} \coloneqq K(\boldsymbol{x}_i, \boldsymbol{x}_j). The whole adjacency matrix WRm×m\boldsymbol{W} \in \mathbb{R}^{m \times m} has the form

W=[K(x1,x1)K(x1,xm)K(xm,x1)K(xm,xm)]Rm×m.\boldsymbol{W} = \begin{bmatrix} K(\boldsymbol{x}_1, \boldsymbol{x}_1) & \dots & K(\boldsymbol{x}_1, \boldsymbol{x}_m) \\ \vdots & \ddots & \vdots \\ K(\boldsymbol{x}_m, \boldsymbol{x}_1) & \dots & K(\boldsymbol{x}_m, \boldsymbol{x}_m) \end{bmatrix} \in \mathbb{R}^{m \times m}.

Another important concept for a graph is the degree matrix. Recall that the degree of a vertex is defined as the sum of the weights of its outgoing edges. The degree matrix is simply a diagonal matrix that collects the degree of each vertex:

D[k=1mK(x1,xk)00k=1mK(xm,xk)]Rm×m.\boldsymbol{D} \coloneqq \begin{bmatrix} \sum_{k = 1}^m K( \boldsymbol{x}_1, \boldsymbol{x}_k) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sum_{k = 1}^m K( \boldsymbol{x}_m, \boldsymbol{x}_k) \end{bmatrix} \in \mathbb{R}^{m \times m}.

2. Markov chain on the graph

Now, with the weighted graph G=(V,E,W)G = (V, E, \boldsymbol{W}) in the previous section, we construct a stochastic process (Xt)tN(X_t)_{t \in \mathbb{N}} taking values in the vertices VV. Specifically, we let the stochastic process to be a time-homogenous Markov Chain, where the transition probability between two vertices is specified via the normalized edge weights attached to these two vertices:

tN:P(Xt+1=vjXt=vi)=K(xi,xj)k=1mK(xi,xk)p(vi,vj).\begin{aligned} \forall t \in \mathbb{N}: \quad \mathbb{P}\left(X_{t+1}= v_j \mid X_{t}= v_i \right) = \frac{ K ( \boldsymbol{x}_i, \boldsymbol{x}_j)}{ \sum_{k = 1}^m K( \boldsymbol{x}_i, \boldsymbol{x}_k) } \eqqcolon p(v_i, v_j). \end{aligned}

We arrange these transition probabilities p(vi,vj)p(v_i, v_j) into a N×NN \times N sized transition matrix

M[p(v1,v1)p(v1,vm)p(vm,v1)p(vm,vm)]Rm×m.\boldsymbol{M} \coloneqq \begin{bmatrix} p(v_1, v_1) & \dots & p(v_1, v_m) \\ \vdots & \ddots & \vdots \\ p(v_m, v_1) & \dots & p(v_m, v_m) \end{bmatrix} \in \mathbb{R}^{m \times m}.

It is easy to verify that M=D1W\boldsymbol{M} = \boldsymbol{D}^{-1} \boldsymbol{W} and that

tN:P(Xt=vjX0=vi)=(Mt)ij.\begin{aligned} \forall t \in \mathbb{N}: \quad \mathbb{P}\left(X_{t}= v_j \mid X_{0} = v_i \right ) = \left( \boldsymbol{M}^{t}\right)_{i j}. \end{aligned}

Characterizing transition probability with the matrix M\boldsymbol{M} is helpful as it allows us to study the MC using tools of linear algebra. For instance, the stationary distribution of the MC (Xt)tN(X_t)_{t \in \mathbb{N}} is encoded in a probability vector π\boldsymbol{\pi} such that

πM=π.\begin{equation} \boldsymbol{\pi}^\top \boldsymbol{M} = \boldsymbol{\pi}^\top. \end{equation}

The stationary probability vector π\boldsymbol{\pi} is thus a left eigenvector of M\boldsymbol{M}. If we assume l1,,lm\boldsymbol{l}_1, \ldots, \boldsymbol{l}_m are unit norm left eigenvectors of M\boldsymbol{M}, then π=l1/l11.\boldsymbol{\pi} = \boldsymbol{l}_1/ \| \boldsymbol{l}_1\|_1. We remark that, by construction, the stationary distribution π=l1/l11\boldsymbol{\pi} = \boldsymbol{l}_1/ \| \boldsymbol{l}_1\|_1 is unique. This is because we have assumed the kernel KK only takes positive values, which is an assumption implying that the markov chain is irreducible. Additionally, via Equation (1), it is straightforward to verify that π\boldsymbol{\pi} is explicitly defined through the degree matrix D\boldsymbol{D}:

π=diag(D)/trace(D).\begin{aligned} \boldsymbol{\pi} = \text{diag}(\boldsymbol{D})/ \text{trace}(\boldsymbol{D}). \end{aligned}

3. Dimensionality reduction by preserving diffusion distance

Based on the MC on the graph, we define a diffusion distance between two vertices:

dt(vi,vj)2k[S][P(Xt=vkX0=vi)P(Xt=vkX0=vj)]2/π[k].\begin{equation*} d_{t}\left(v_i, v_j\right)^{2} \coloneqq \sum_{k \in [S]} \Big [ \mathbb{P} (X_{t} = v_k \mid X_0 = v_i) - \mathbb{P} (X_{t} = v_k \mid X_0 = v_j) \Big ]^{2} \Big/ \boldsymbol{\pi}[k]. \end{equation*}

Intuitively, this diffusion distance dtd_{t} measures the local connectivity between two vertices at a timescale prescribed by tt. Writing the conditional probabilities in terms of the entries of the transition matrix M\boldsymbol{M}, we discover

dt(vi,vj)2=k[S][(Mt)ik(Mt)jk]2/π[k].\begin{equation*} d_{t}\left(v_i, v_j\right)^{2} =\sum_{k \in [S]}\Big [ (\boldsymbol{M}^{t})_{ik}-(\boldsymbol{M}^{t})_{jk}\Big ]^{2} \Big/ \boldsymbol{\pi}[k]. \end{equation*}

The above equation is a linear algebraic way of computing the distance dt(vi,vj)2d_{t}\left(v_i, v_j\right)^{2} of transition probabilities. However, evaluating the term dt(vi,vj)2d_{t}\left(v_i, v_j\right)^{2} as above is still inconvenient as it involves the power of a matrix. In what follows, we show that the distance in Equation dt(vi,vj)2d_{t}\left(v_i, v_j\right)^{2} can be significantly simplified by using the eigendecomposition of M\boldsymbol{M}.

We proceed to write out the right and left eivenvectors of M\boldsymbol{M} and examine their relationships. To this end, we first consider a “normalized’’ version of M\boldsymbol{M}, which has the form

MD12MD12.\begin{equation*} \overline{\boldsymbol{M}} \coloneqq \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{M} \boldsymbol{D}^{-\frac{1}{2}} . \end{equation*}

One nice property of M\overline{\boldsymbol{M}} is that it is positive semidefinite (PSD):

qRm,qMq=qD12MD12q=qD12D1WD12q=qD12WD12q=(D12q)W(D12q)0,\begin{aligned} \forall \boldsymbol{q} \in \mathbb{R}^m, \quad \boldsymbol{q}^\top \overline{\boldsymbol{M}} \boldsymbol{q} & = \boldsymbol{q}^\top \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{M} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = \boldsymbol{q}^\top \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{D}^{-1} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = \boldsymbol{q}^\top \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = (\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q})^\top \boldsymbol{W} (\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q}) \cr & \geq 0, \end{aligned}

where in the last step we used the fact that the kernel matrix W\boldsymbol{W} is PSD. Note that the original, unnormalized matrix M\boldsymbol{M} is not PSD because it is not symmetric.

Since M\overline{\boldsymbol{M}} is PSD, we may write MUSU,\overline{\boldsymbol{M}} \coloneqq \boldsymbol{U} \boldsymbol{S} \boldsymbol{U}^\top, where U\boldsymbol{U} is an orthonormal matrix whose columns u1,,um\boldsymbol{u}_1, \ldots, \boldsymbol{u}_m are eigenvectors of M\overline{\boldsymbol{M}} and S\boldsymbol{S} is a diagonal matrix whose entries are non-negative eigenvalues of M\overline{\boldsymbol{M}}. As a consequence, we have

M=D12MD12=D12UundefinedRSUD12=RSR1.\begin{aligned} {\boldsymbol{M}} & = \boldsymbol{D}^{-\frac{1}{2}} \overline{\boldsymbol{M}} \boldsymbol{D}^{\frac{1}{2}}\cr & = \underbrace{\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{U}}_{\coloneqq \boldsymbol{R}} \boldsymbol{S} \boldsymbol{U}^\top \boldsymbol{D}^{\frac{1}{2}} = \boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1}. \end{aligned}

By the identity M=RSR1,\boldsymbol{M} = \boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1}, it is clear that columns of R\boldsymbol{R} are right eigenvectors of M\boldsymbol{M} and rows of R1\boldsymbol{R}^{-1} are left eigenvectors of M.\boldsymbol{M}. For convenience, we write

R=[r1rm] and R1=[l1lm]\boldsymbol{R} = \begin{bmatrix} \vert && \vert \\ \boldsymbol{r}_1 & \vdots & \boldsymbol{r}_m \\ \vert & & \vert \end{bmatrix} \quad {\text{~and~}} \quad \boldsymbol{R}^{-1} = \begin{bmatrix} \text{---} & \boldsymbol{l}_1^\top & \text{---} \\ & \cdots & \\ \text{---} & \boldsymbol{l}_m^ \top & \text{---} \end{bmatrix}

and use ri\boldsymbol{r}_i and li\boldsymbol{l}_i to denote the ii-th right and left eigenvectors of M\boldsymbol{M} associated to eigenvalue sis_i. Since RD12U\boldsymbol{R} \coloneqq \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{U} and R1=UD12\boldsymbol{R}^{-1} = \boldsymbol{U}^\top \boldsymbol{D}^{\frac{1}{2}}, we see that ri\boldsymbol{r}_i and li\boldsymbol{l}_i are linked through ui\boldsymbol{u}_i:

i[m]:ri=D12ui and li=D12ui.\begin{equation*} \forall i \in [m]: \quad \boldsymbol{r}_i = \boldsymbol{D}^{- \frac{1}{2}} \boldsymbol{u}_i \quad \text{~and~} \quad \boldsymbol{l}_i = \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{u}_i. \end{equation*}

Additionally, we note that Mt=(RSR1)t=RStR1=h=1mshtrhlh,\boldsymbol{M}^t = (\boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1})^{t} = \boldsymbol{R} \boldsymbol{S}^t \boldsymbol{R}^{-1} = \sum_{h=1}^{m} s_{h}^t \boldsymbol{r}_h \boldsymbol{l}_h^\top, which suggests (Mt)ij=h=1mshtrh[i]lh[j].(\boldsymbol{M}^t)_{ij} = \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[i] \boldsymbol{l}_{h}[j].

Hence

dt(vi,vj)2=k[S][(Mt)ik(Mt)jk]2/π[k]=k[S][h=1mshtrh[i]lh[k]h=1mshtrh[j]lh[k]]2/π[k]=k[S][h=1msht(rh[i]rh[j])lh[k]]2/π[k]=k[S]h=1mh=1m(sht(rh[i]rh[j])lh[k])(sht(rh[i]rh[j])lh[k])/π[k]=h=1mh=1m(sht(rh[i]rh[j]))(sht(rh[i]rh[j]))k[S]lh[k]lh[k]π[k].\begin{aligned} d_{t}\left(v_i, v_j\right)^{2} &=\sum_{k \in [S]} {\Big [ (\boldsymbol{M}^{t})_{ik}-(\boldsymbol{M}^{t})_{jk}\Big ]^{2}} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \Big [ \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[i] \boldsymbol{l}_{h}[k] - \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[j] \boldsymbol{l}_{h}[k] \Big ]^{2} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \Big [ \sum_{h=1}^m s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \boldsymbol{l}_{h}[k] \Big ]^{2} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \sum_{h=1}^m \sum_{h'= 1}^m \Big ( s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \boldsymbol{l}_{h}[k] \Big ) \Big ( s_{h'}^t (\boldsymbol{r}_{h'}[i] - \boldsymbol{r}_{h'}[j]) \boldsymbol{l}_{h'}[k] \Big ) \Big/ \boldsymbol{\pi}[k] \cr &= \sum_{h=1}^m \sum_{h'= 1}^m \left( s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \right) \left( s_{h'}^t (\boldsymbol{r}_{h'}[i] - \boldsymbol{r}_{h'}[j]) \right) \sum_{k \in [S]} \frac{\boldsymbol{l}_{h}[k] \boldsymbol{l}_{h'}[k]}{\boldsymbol{\pi}[k]}. \end{aligned}

As remarked in the second section, there is an explicit expression for the stationary distribution over vertices: π[k]=D[k,k]trace(D)\boldsymbol{\pi}[k] = \frac{\boldsymbol{D}[k,k]}{\text{trace}(\boldsymbol{D})}. Plug this expression into the above equation, we get

lh[k]lh[k]π[k]=trace(D)lh[k]lh[k]D[k,k]=trace(D)(D12uh)[k]lh[k]D[k,k]=trace(D)(D12uh)[k]lh[k]=trace(D)rh[k]lh[k]={0if hh,trace(D)otherwise.\begin{aligned} \frac{\boldsymbol{l}_{h}[k] \boldsymbol{l}_{h'}[k]}{\boldsymbol{\pi}[k]} & = \frac{ \text{trace}(\boldsymbol{D}) \cdot \boldsymbol{l}_{h}[k] \cdot \boldsymbol{l}_{h'}[k]}{ \boldsymbol{D}[k, k]} \cr & = \frac{ \text{trace}(\boldsymbol{D}) \cdot \left (\boldsymbol{D}^{\frac{1}{2}} \boldsymbol{u}_{h} \right )[k] \cdot \boldsymbol{l}_{h'}[k]}{ \boldsymbol{D}[k, k]} \cr & = \text{trace}(\boldsymbol{D}) \cdot \left (\boldsymbol{D}^{- \frac{1}{2}} \boldsymbol{u}_{h} \right )[k] \cdot \boldsymbol{l}_{h'}[k] \cr & = \text{trace}(\boldsymbol{D}) \cdot \boldsymbol{r}_{h}[k] \cdot \boldsymbol{l}_{h'}[k] \cr & = \begin{cases} 0 & \text{if } h \neq h', \cr \text{trace}(\boldsymbol{D}) & \text{otherwise}. \end{cases} \end{aligned}

As a result, we have

dt(vi,vj)2=trace(D)h=1msh2t(rh[i]rh[j])2=trace(D)h=2msh2t(rh[i]rh[j])2h=2msh2t(rh[i]rh[j])2\begin{aligned} d_{t}(v_i, v_j)^{2} & = \text{trace}(\boldsymbol{D}) \sum_{h=1}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \\ & = \text{trace}(\boldsymbol{D}) \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \\ & \propto \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \end{aligned}

where in second last step we used the fact that r1\boldsymbol{r}_1 is a constant eigenvector and therefore the entrywise difference (r1[i](\boldsymbol{r}_{1}[i] - r1[j])2\boldsymbol{r}_{1}[j])^2 vanishes.

To summarize, we proved:

dt(vi,vj)2h=2msh2t(rh[i]rh[j])2.\begin{equation} \boxed{\quad d_{t}(v_i, v_j)^{2} \propto \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2. \quad } \end{equation}

Finally, we define the diffusion map with truncated order nn as

Φtn:XRnxi(s2tr2[i]s3tr3[i]sn+1trn+1[i]).\begin{equation} \boxed{ \begin{split} \quad \Phi_t^n: \mathcal{X} & \to \mathbb{R}^{n} \quad \\ \boldsymbol{x}_i & \mapsto \begin{pmatrix} s_2^{t} \boldsymbol{r}_2[i] \\ s_3^{t} \boldsymbol{r}_3[i] \\ \vdots \\ s_{n+1}^{t} \boldsymbol{r}_{n+1}[i] \end{pmatrix}. \quad \end{split}} \end{equation}

Then through Equation (2), we see that if we choose n=m1n = m -1, the euclidean distance of the diffusion maps of two vectors xi\boldsymbol{x}_i and xj\boldsymbol{x}_j reconstructs the diffusion distance up to a scalar:

Φtm1(xi)Φtm1(xj)22dt(vi,vj)2.\| \Phi_t^{m-1}(\boldsymbol{x}_i) - \Phi_t^{m-1}(\boldsymbol{x}_j) \|_2^2 \propto d_{t}(v_i, v_j)^{2}.

In the interest to reduce the dimensionality of the data, we may choose nmin(m1,N).n \ll \min(m-1, N).