# A detailed derivation of the diffusion map

The diffusion map is an extensively used dimensionality reduction technique. It represents high-dimensional data points using an underlying graph, which non-linearly encodes the geometrical similarities between data points. Despite its popularity, few tutorials introduce the diffusion map in precision that I find satisfying. This blog delves deep into the derivation.

To formalize, we let $$\mathcal{X} \coloneqq \{ \boldsymbol{x}_1, \ldots, \boldsymbol{x}_m\} \subset \mathbb{R}^N$$ be a given set of high-dimensional data. In this note, we will construct the diffusion maps

$\begin{equation*} \begin{split} \Phi^n: \mathcal{X} & \to \mathbb{R}^{n} \\ \boldsymbol{x} & \mapsto \Phi(\boldsymbol{x}), \end{split} \end{equation*}$

where $$n \leq \min(m - 1, N)$$. Specifically, to construct the diffusion maps $$\Phi^n$$, we take three steps:

1. View the high-dimensional points in $$\mathcal{X}$$ as vertices in an underlying graph (Section 1 ).
2. Define a Markov Chain (MC) on the graph, so that local connectivity between vertices of the graph has a probability interpretation (Section 2 ).
3. Assign each data point in $$\mathcal{X}$$ a low-dimensional vector, with respect to which the graph-based local connectivity is approximately preserved (Section 3).

## 1. Graph representation of data points

To start the diffusion maps machinery, we first describe each high-dimensional data point $$\boldsymbol{x}_i \in \mathbb{R}^{N}$$ by its relationship with other data points in the given dataset $$\mathcal{X} \coloneqq \{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_m\}$$. To this end, we assume that each data point $$\boldsymbol{x}_i$$ is emitted from a vertex $$v_i$$ of an underlying directed weighted graph $$G = (V, E, \boldsymbol{W})$$. Here, $$V \coloneqq \{v_1, \ldots, v_m\}$$ is the set of vertices, $$E \subset V \times V$$ is a set of ordered pairs of vertices representing edges, and $$\boldsymbol{W}$$ is the adjacency matrix such that $$\boldsymbol{W}_{ij} > 0$$ if $$(i,j) \in E$$ and $$\boldsymbol{W}_{ij} = 0$$ otherwise.

One advantage of having an underlying graph representation of data is that the similarities of high-dimensional points $$\boldsymbol{x}_i$$ and $$\boldsymbol{x}_j$$ can be encapsulated by $$\boldsymbol{W}_{ij}$$, that is, the connection strength or the weight between two vertices $$v_i$$ and $$v_j$$. But how to specify these weights? In diffusion maps, we let each connection strength be specified through a kernel function $$K$$:

\begin{aligned} K: \mathbb{R}^N \times \mathbb{R}^N & \to \mathbb{R}_{+} \cr (\boldsymbol{x}_i, \boldsymbol{x}_j) & \mapsto K(\boldsymbol{x}_i, \boldsymbol{x}_j). \end{aligned}

Note that here we assume that the kernel $$K$$ only takes positive values. This is a convenient choice as it guarantees the Markov chain to be defined later in to have a unique stationary state. One common choice of the kernel is the radial basis kernel $$K(\boldsymbol{x}_i, \boldsymbol{x}_j)=\exp \left(-\frac{\|\boldsymbol{x}_i-\boldsymbol{x}_j\|^{2}}{\epsilon}\right) \text{~with~} \epsilon > 0.$$ In this way, we construct an adjacency matrix with components $$\boldsymbol{W}_{ij} \coloneqq K(\boldsymbol{x}_i, \boldsymbol{x}_j)$$. The whole adjacency matrix $$\boldsymbol{W} \in \mathbb{R}^{m \times m}$$ has the form

$\boldsymbol{W} = \begin{bmatrix} K(\boldsymbol{x}_1, \boldsymbol{x}_1) & \dots & K(\boldsymbol{x}_1, \boldsymbol{x}_m) \\ \vdots & \ddots & \vdots \\ K(\boldsymbol{x}_m, \boldsymbol{x}_1) & \dots & K(\boldsymbol{x}_m, \boldsymbol{x}_m) \end{bmatrix} \in \mathbb{R}^{m \times m}.$

Another important concept for a graph is the degree matrix. Recall that the degree of a vertex is defined as the sum of the weights of its outgoing edges. The degree matrix is simply a diagonal matrix that collects the degree of each vertex:

$\boldsymbol{D} \coloneqq \begin{bmatrix} \sum_{k = 1}^m K( \boldsymbol{x}_1, \boldsymbol{x}_k) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sum_{k = 1}^m K( \boldsymbol{x}_m, \boldsymbol{x}_k) \end{bmatrix} \in \mathbb{R}^{m \times m}.$

## 2. Markov chain on the graph

Now, with the weighted graph $$G = (V, E, \boldsymbol{W})$$ in the previous section, we construct a stochastic process $$(X_t)_{t \in \mathbb{N}}$$ taking values in the vertices $$V$$. Specifically, we let the stochastic process to be a time-homogenous Markov Chain, where the transition probability between two vertices is specified via the normalized edge weights attached to these two vertices:

\begin{aligned} \forall t \in \mathbb{N}: \quad \mathbb{P}\left(X_{t+1}= v_j \mid X_{t}= v_i \right) = \frac{ K ( \boldsymbol{x}_i, \boldsymbol{x}_j)}{ \sum_{k = 1}^m K( \boldsymbol{x}_i, \boldsymbol{x}_k) } \eqqcolon p(v_i, v_j). \end{aligned}

We arrange these transition probabilities $$p(v_i, v_j)$$ into a $$N \times N$$ sized transition matrix

$\boldsymbol{M} \coloneqq \begin{bmatrix} p(v_1, v_1) & \dots & p(v_1, v_m) \\ \vdots & \ddots & \vdots \\ p(v_m, v_1) & \dots & p(v_m, v_m) \end{bmatrix} \in \mathbb{R}^{m \times m}.$

It is easy to verify that $$\boldsymbol{M} = \boldsymbol{D}^{-1} \boldsymbol{W}$$ and that

\begin{aligned} \forall t \in \mathbb{N}: \quad \mathbb{P}\left(X_{t}= v_j \mid X_{0} = v_i \right ) = \left( \boldsymbol{M}^{t}\right)_{i j}. \end{aligned}

Characterizing transition probability with the matrix $$\boldsymbol{M}$$ is helpful as it allows us to study the MC using tools of linear algebra. For instance, the stationary distribution of the MC $$(X_t)_{t \in \mathbb{N}}$$ is encoded in a probability vector $$\boldsymbol{\pi}$$ such that

$\begin{equation} \boldsymbol{\pi}^\top \boldsymbol{M} = \boldsymbol{\pi}^\top. \end{equation}$

The stationary probability vector $$\boldsymbol{\pi}$$ is thus a left eigenvector of $$\boldsymbol{M}$$. If we assume $$\boldsymbol{l}_1, \ldots, \boldsymbol{l}_m$$ are unit norm left eigenvectors of $$\boldsymbol{M}$$, then $$\boldsymbol{\pi} = \boldsymbol{l}_1/ \| \boldsymbol{l}_1\|_1.$$ We remark that, by construction, the stationary distribution $$\boldsymbol{\pi} = \boldsymbol{l}_1/ \| \boldsymbol{l}_1\|_1$$ is unique. This is because we have assumed the kernel $$K$$ only takes positive values, which is an assumption implying that the markov chain is irreducible. Additionally, via Equation (1), it is straightforward to verify that $$\boldsymbol{\pi}$$ is explicitly defined through the degree matrix $$\boldsymbol{D}$$:

\begin{aligned} \boldsymbol{\pi} = \text{diag}(\boldsymbol{D})/ \text{trace}(\boldsymbol{D}). \end{aligned}

## 3. Dimensionality reduction by preserving diffusion distance

Based on the MC on the graph, we define a diffusion distance between two vertices:

$\begin{equation*} d_{t}\left(v_i, v_j\right)^{2} \coloneqq \sum_{k \in [S]} \Big [ \mathbb{P} (X_{t} = v_k \mid X_0 = v_i) - \mathbb{P} (X_{t} = v_k \mid X_0 = v_j) \Big ]^{2} \Big/ \boldsymbol{\pi}[k]. \end{equation*}$

Intuitively, this diffusion distance $$d_{t}$$ measures the local connectivity between two vertices at a timescale prescribed by $$t$$. Writing the conditional probabilities in terms of the entries of the transition matrix $$\boldsymbol{M}$$, we discover

$\begin{equation*} d_{t}\left(v_i, v_j\right)^{2} =\sum_{k \in [S]}\Big [ (\boldsymbol{M}^{t})_{ik}-(\boldsymbol{M}^{t})_{jk}\Big ]^{2} \Big/ \boldsymbol{\pi}[k]. \end{equation*}$

The above equation is a linear algebraic way of computing the distance $$d_{t}\left(v_i, v_j\right)^{2}$$ of transition probabilities. However, evaluating the term $$d_{t}\left(v_i, v_j\right)^{2}$$ as above is still inconvenient as it involves the power of a matrix. In what follows, we show that the distance in Equation $$d_{t}\left(v_i, v_j\right)^{2}$$ can be significantly simplified by using the eigendecomposition of $$\boldsymbol{M}$$.

We proceed to write out the right and left eivenvectors of $$\boldsymbol{M}$$ and examine their relationships. To this end, we first consider a “normalized’’ version of $$\boldsymbol{M}$$, which has the form

$\begin{equation*} \overline{\boldsymbol{M}} \coloneqq \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{M} \boldsymbol{D}^{-\frac{1}{2}} . \end{equation*}$

One nice property of $$\overline{\boldsymbol{M}}$$ is that it is positive semidefinite (PSD):

\begin{aligned} \forall \boldsymbol{q} \in \mathbb{R}^m, \quad \boldsymbol{q}^\top \overline{\boldsymbol{M}} \boldsymbol{q} & = \boldsymbol{q}^\top \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{M} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = \boldsymbol{q}^\top \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{D}^{-1} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = \boldsymbol{q}^\top \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q} \cr & = (\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q})^\top \boldsymbol{W} (\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{q}) \cr & \geq 0, \end{aligned}

where in the last step we used the fact that the kernel matrix $$\boldsymbol{W}$$ is PSD. Note that the original, unnormalized matrix $$\boldsymbol{M}$$ is not PSD because it is not symmetric.

Since $$\overline{\boldsymbol{M}}$$ is PSD, we may write $$\overline{\boldsymbol{M}} \coloneqq \boldsymbol{U} \boldsymbol{S} \boldsymbol{U}^\top,$$ where $$\boldsymbol{U}$$ is an orthonormal matrix whose columns $$\boldsymbol{u}_1, \ldots, \boldsymbol{u}_m$$ are eigenvectors of $$\overline{\boldsymbol{M}}$$ and $$\boldsymbol{S}$$ is a diagonal matrix whose entries are non-negative eigenvalues of $$\overline{\boldsymbol{M}}$$. As a consequence, we have

\begin{aligned} {\boldsymbol{M}} & = \boldsymbol{D}^{-\frac{1}{2}} \overline{\boldsymbol{M}} \boldsymbol{D}^{\frac{1}{2}}\cr & = \underbrace{\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{U}}_{\coloneqq \boldsymbol{R}} \boldsymbol{S} \boldsymbol{U}^\top \boldsymbol{D}^{\frac{1}{2}} = \boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1}. \end{aligned}

By the identity $$\boldsymbol{M} = \boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1},$$ it is clear that columns of $$\boldsymbol{R}$$ are right eigenvectors of $$\boldsymbol{M}$$ and rows of $$\boldsymbol{R}^{-1}$$ are left eigenvectors of $$\boldsymbol{M}.$$ For convenience, we write

$\boldsymbol{R} = \begin{bmatrix} \vert && \vert \\ \boldsymbol{r}_1 & \vdots & \boldsymbol{r}_m \\ \vert & & \vert \end{bmatrix} \quad {\text{~and~}} \quad \boldsymbol{R}^{-1} = \begin{bmatrix} \text{---} & \boldsymbol{l}_1^\top & \text{---} \\ & \cdots & \\ \text{---} & \boldsymbol{l}_m^ \top & \text{---} \end{bmatrix}$

and use $$\boldsymbol{r}_i$$ and $$\boldsymbol{l}_i$$ to denote the $$i$$-th right and left eigenvectors of $$\boldsymbol{M}$$ associated to eigenvalue $$s_i$$. Since $$\boldsymbol{R} \coloneqq \boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{U}$$ and $$\boldsymbol{R}^{-1} = \boldsymbol{U}^\top \boldsymbol{D}^{\frac{1}{2}}$$, we see that $$\boldsymbol{r}_i$$ and $$\boldsymbol{l}_i$$ are linked through $$\boldsymbol{u}_i$$:

$\begin{equation*} \forall i \in [m]: \quad \boldsymbol{r}_i = \boldsymbol{D}^{- \frac{1}{2}} \boldsymbol{u}_i \quad \text{~and~} \quad \boldsymbol{l}_i = \boldsymbol{D}^{\frac{1}{2}} \boldsymbol{u}_i. \end{equation*}$

Additionally, we note that $$\boldsymbol{M}^t = (\boldsymbol{R} \boldsymbol{S} \boldsymbol{R}^{-1})^{t} = \boldsymbol{R} \boldsymbol{S}^t \boldsymbol{R}^{-1} = \sum_{h=1}^{m} s_{h}^t \boldsymbol{r}_h \boldsymbol{l}_h^\top,$$ which suggests $$(\boldsymbol{M}^t)_{ij} = \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[i] \boldsymbol{l}_{h}[j].$$

Hence

\begin{aligned} d_{t}\left(v_i, v_j\right)^{2} &=\sum_{k \in [S]} {\Big [ (\boldsymbol{M}^{t})_{ik}-(\boldsymbol{M}^{t})_{jk}\Big ]^{2}} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \Big [ \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[i] \boldsymbol{l}_{h}[k] - \sum_{h=1}^m s_h^t \boldsymbol{r}_{h}[j] \boldsymbol{l}_{h}[k] \Big ]^{2} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \Big [ \sum_{h=1}^m s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \boldsymbol{l}_{h}[k] \Big ]^{2} \Big/ \boldsymbol{\pi}[k] \cr &=\sum_{k \in [S]} \sum_{h=1}^m \sum_{h'= 1}^m \Big ( s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \boldsymbol{l}_{h}[k] \Big ) \Big ( s_{h'}^t (\boldsymbol{r}_{h'}[i] - \boldsymbol{r}_{h'}[j]) \boldsymbol{l}_{h'}[k] \Big ) \Big/ \boldsymbol{\pi}[k] \cr &= \sum_{h=1}^m \sum_{h'= 1}^m \left( s_h^t (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j]) \right) \left( s_{h'}^t (\boldsymbol{r}_{h'}[i] - \boldsymbol{r}_{h'}[j]) \right) \sum_{k \in [S]} \frac{\boldsymbol{l}_{h}[k] \boldsymbol{l}_{h'}[k]}{\boldsymbol{\pi}[k]}. \end{aligned}

As remarked in the second section, there is an explicit expression for the stationary distribution over vertices: $$\boldsymbol{\pi}[k] = \frac{\boldsymbol{D}[k,k]}{\text{trace}(\boldsymbol{D})}$$. Plug this expression into the above equation, we get

\begin{aligned} \frac{\boldsymbol{l}_{h}[k] \boldsymbol{l}_{h'}[k]}{\boldsymbol{\pi}[k]} & = \frac{ \text{trace}(\boldsymbol{D}) \cdot \boldsymbol{l}_{h}[k] \cdot \boldsymbol{l}_{h'}[k]}{ \boldsymbol{D}[k, k]} \cr & = \frac{ \text{trace}(\boldsymbol{D}) \cdot \left (\boldsymbol{D}^{\frac{1}{2}} \boldsymbol{u}_{h} \right )[k] \cdot \boldsymbol{l}_{h'}[k]}{ \boldsymbol{D}[k, k]} \cr & = \text{trace}(\boldsymbol{D}) \cdot \left (\boldsymbol{D}^{- \frac{1}{2}} \boldsymbol{u}_{h} \right )[k] \cdot \boldsymbol{l}_{h'}[k] \cr & = \text{trace}(\boldsymbol{D}) \cdot \boldsymbol{r}_{h}[k] \cdot \boldsymbol{l}_{h'}[k] \cr & = \begin{cases} 0 & \text{if } h \neq h', \cr \text{trace}(\boldsymbol{D}) & \text{otherwise}. \end{cases} \end{aligned}

As a result, we have

\begin{aligned} d_{t}(v_i, v_j)^{2} & = \text{trace}(\boldsymbol{D}) \sum_{h=1}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \\ & = \text{trace}(\boldsymbol{D}) \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \\ & \propto \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2 \end{aligned}

where in second last step we used the fact that $$\boldsymbol{r}_1$$ is a constant eigenvector and therefore the entrywise difference $$(\boldsymbol{r}_{1}[i]$$ - $$\boldsymbol{r}_{1}[j])^2$$ vanishes.

To summarize, we proved:

$\begin{equation} \boxed{\quad d_{t}(v_i, v_j)^{2} \propto \sum_{h=2}^m s_h^{2t} (\boldsymbol{r}_{h}[i] - \boldsymbol{r}_{h}[j])^2. \quad } \end{equation}$

Finally, we define the diffusion map with truncated order $$n$$ as

$\begin{equation} \boxed{ \begin{split} \quad \Phi_t^n: \mathcal{X} & \to \mathbb{R}^{n} \quad \\ \boldsymbol{x}_i & \mapsto \begin{pmatrix} s_2^{t} \boldsymbol{r}_2[i] \\ s_3^{t} \boldsymbol{r}_3[i] \\ \vdots \\ s_{n+1}^{t} \boldsymbol{r}_{n+1}[i] \end{pmatrix}. \quad \end{split}} \end{equation}$

Then through Equation (2), we see that if we choose $$n = m -1$$, the euclidean distance of the diffusion maps of two vectors $$\boldsymbol{x}_i$$ and $$\boldsymbol{x}_j$$ reconstructs the diffusion distance up to a scalar:

$\| \Phi_t^{m-1}(\boldsymbol{x}_i) - \Phi_t^{m-1}(\boldsymbol{x}_j) \|_2^2 \propto d_{t}(v_i, v_j)^{2}.$

In the interest to reduce the dimensionality of the data, we may choose $$n \ll \min(m-1, N).$$

Date: May 29, 2021
Tags: