The diffusion map is an extensively used dimensionality reduction technique. It represents high-dimensional data points using an underlying graph, which non-linearly encodes the geometrical similarities between data points. Despite its popularity, few tutorials introduce the diffusion map in precision that I find satisfying. This blog delves deep into the derivation.
To formalize, we let be a given set of high-dimensional data. In this note, we will construct the diffusion maps
where . Specifically, to construct the diffusion maps , we take three steps:
- View the high-dimensional points in as vertices in an underlying graph (Section 1 ).
- Define a Markov Chain (MC) on the graph, so that local connectivity between vertices of the graph has a probability interpretation (Section 2 ).
- Assign each data point in a low-dimensional vector, with respect to which the graph-based local connectivity is approximately preserved (Section 3).
1. Graph representation of data points
To start the diffusion maps machinery, we first describe each high-dimensional data point by its relationship with other data points in the given dataset . To this end, we assume that each data point is emitted from a vertex of an underlying directed weighted graph . Here, is the set of vertices, is a set of ordered pairs of vertices representing edges, and is the adjacency matrix such that if and otherwise.
One advantage of having an underlying graph representation of data is that the similarities of high-dimensional points and can be encapsulated by , that is, the connection strength or the weight between two vertices and . But how to specify these weights? In diffusion maps, we let each connection strength be specified through a kernel function :
Note that here we assume that the kernel only takes positive values. This is a convenient choice as it guarantees the Markov chain to be defined later in to have a unique stationary state. One common choice of the kernel is the radial basis kernel
In this way, we construct an adjacency matrix with components . The whole adjacency matrix has the form
Another important concept for a graph is the degree matrix. Recall that the degree of a vertex is defined as the sum of the weights of its outgoing edges. The degree matrix is simply a diagonal matrix that collects the degree of each vertex:
2. Markov chain on the graph
Now, with the weighted graph in the previous section, we construct a stochastic process taking values in the vertices . Specifically, we let the stochastic process to be a time-homogenous Markov Chain, where the transition probability between two vertices is specified via the normalized edge weights attached to these two vertices:
We arrange these transition probabilities into a sized transition matrix
It is easy to verify that and that
Characterizing transition probability with the matrix is helpful as it allows us to study the MC using tools of linear algebra. For instance, the stationary distribution of the MC is encoded in a probability vector such that
The stationary probability vector is thus a left eigenvector of . If we assume are unit norm left eigenvectors of , then We remark that, by construction, the stationary distribution is unique. This is because we have assumed the kernel only takes positive values, which is an assumption implying that the markov chain is irreducible. Additionally, via Equation (1), it is straightforward to verify that is explicitly defined through the degree matrix :
3. Dimensionality reduction by preserving diffusion distance
Based on the MC on the graph, we define a diffusion distance between two vertices:
Intuitively, this diffusion distance measures the local connectivity between two vertices at a timescale prescribed by .
Writing the conditional probabilities in terms of the entries of the transition matrix , we discover
The above equation is a linear algebraic way of computing the distance of transition probabilities. However, evaluating the term as above is still inconvenient as it involves the power of a matrix. In what follows, we show that the distance in Equation can be significantly simplified by using the eigendecomposition of .
We proceed to write out the right and left eivenvectors of and examine their relationships. To this end, we first consider a “normalized’’ version of , which has the form
One nice property of is that it is positive semidefinite (PSD):
where in the last step we used the fact that the kernel matrix is PSD. Note that the original, unnormalized matrix is not PSD because it is not symmetric.
Since is PSD, we may write
where is an orthonormal matrix whose columns are eigenvectors of and is a diagonal matrix whose entries are non-negative eigenvalues of . As a consequence, we have
By the identity it is clear that columns of are right eigenvectors of and rows of are left eigenvectors of For convenience, we write
and use and to denote the -th right and left eigenvectors of associated to eigenvalue . Since and , we see that and are linked through :
Additionally, we note that
As remarked in the second section, there is an explicit expression for the stationary distribution over vertices: . Plug this expression into the above equation, we get
As a result, we have
where in second last step we used the fact that is a constant eigenvector and therefore the entrywise difference - vanishes.
To summarize, we proved:
Finally, we define the diffusion map with truncated order as
Then through Equation (2), we see that if we choose , the euclidean distance of the diffusion maps of two vectors and reconstructs the diffusion distance up to a scalar:
In the interest to reduce the dimensionality of the data, we may choose