JL Lemma, a demo

draft

Imports and Setup

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
Code
!sudo apt install cm-super dvipng texlive-latex-extra texlive-latex-recommended
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cm-super is already the newest version (0.3.4-17).
dvipng is already the newest version (1.15-1.1).
texlive-latex-extra is already the newest version (2021.20220204-1).
texlive-latex-recommended is already the newest version (2021.20220204-1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Code
plt.rcParams['text.usetex'] = True
plt.rcParams['figure.dpi'] = 200
plt.rcParams['text.latex.preamble'] = r"\usepackage{amsfonts} \usepackage{amsmath}"

To ensure reproducability, we will initiate a NumPy random number generator and set a random seed.

Code
rng = np.random.default_rng(seed = 1023)

JL From Scratch

In this section, we will start with a toy dataset and verify the claim made by the JL lemma. First, we state the lemma again:

Let \(\displaystyle D\) be a dataset of \(\displaystyle n\) points in \(\displaystyle \mathbb{R}^{d}\). For any \(\displaystyle \epsilon \in ( 0,1)\) and \(\displaystyle k\geqslant \frac{8\log( n)}{\epsilon ^{2}}\), there exists a linear map \(\displaystyle f:\mathbb{R}^{d}\rightarrow \mathbb{R}^{k}\), such that for any \(\displaystyle x,y\in D\):

\[ \begin{equation*} 1-\epsilon \leqslant \frac{||f( x) -f( y) ||^{2}}{||x-y||^{2}} \leqslant 1+\epsilon \end{equation*} \]

We start with a function to generate a data matrix \(X\) of shape \(d \times n\). Here, each element is sampled from a uniform distribution. Note that the JL lemma is agnostic to the nature of the data. The claim holds for any set of \(n\) points in \(\mathbb{R}^{d}\).

Code
def get_data(d, n):
    X = rng.uniform(
    size=(d, n)
    )
    return X

First, we have a functiont to compute the pair-wise distances between all the \(n\) points in the dataset and store them in an arryay of shape \(n \times n\). The diagonals are zero since each entry measures the distance of a point from itself. We will add \(1\) to each diagonal entry. The reason for this will become clear soon.

Code
def get_dist(X):
    dist = euclidean_distances(
        X.T,
        squared=True
    )
    dist = np.triu(dist)
    nonzero = (dist != 0)
    return dist[nonzero]

Set a value for \(\epsilon\), the quantity that controls the extent of distortion that we are willing to tolerate. Using \(\epsilon\) and \(n\), compute the lower bound for \(k\) as given by the JL lemma.

Code
def get_k_theory(n, eps):
    k = 8 * np.log(n) / (eps**2)
    return int(np.ceil(k))

We can now put down the function that generates a random matrix:

Code
def get_matrix(k, d):
    R = rng.normal(
        size=(k,d)
    )
    return R / (k**0.5)

With the matrix in place, we can perform the transformation.

Code
def transform(X, M):
    return M @ X

Next, we have a function that stores the distances between the original dataset and the transformed dataset.

Code
def get_ratio(orig_dist, trans_dist):
    return trans_dist / orig_dist

The following function checks if the distortions are within the allowed limits for all the data-points in the dataset. It returns False if even a single pair of points suffer a distortion outside the given range. It returns True for a valid \(\epsilon\)-isometry.

Code
def is_eps_isometry(ratio_dist, eps):
    return (np.min(ratio_dist) >= 1 - eps and
            np.max(ratio_dist) <= 1 + eps)

All the functions we need are in place. Now, we go ahead and compute the \(\epsilon\)-isometry.

Code
# Initialization
d, n = 10_000, 5_000
X = get_data(d, n)
orig_dist = get_dist(X)
eps = 0.2
k = get_k_theory(n, eps)

print(f'Number of data-points = {n}')
print(f'Dimension of parent space = {d}')
print(f'Distortion limit, eps = {eps}')
print(f'Dimension of projected space = {k}')
Number of data-points = 5000
Dimension of parent space = 10000
Distortion limit, eps = 0.2
Dimension of projected space = 1704
Code
# Computing the eps-isometry
ratio_dist = np.array([2, ])
iter = 0
while not(is_eps_isometry(ratio_dist, eps)):
    M = get_matrix(k, d)
    X_t = transform(X, M)
    trans_dist = get_dist(X_t)
    ratio_dist = get_ratio(orig_dist, trans_dist)
    iter += 1

print(f'Approximate Isometry found after {iter} iterations!')
print(f'Maximum distortion = {ratio_dist.max()}')
print(f'Minimum distortion = {ratio_dist.min()}')
Approximate Isometry found after 2 iterations!
Maximum distortion = 1.1980556470711285
Minimum distortion = 0.823782054955422

Effect of \(k\) on pair-wise distances

We will now turn to the effect of the projecting dimension on the pair-wise distances. To visualize this, we will look at the scatter of the pair-wise distances before and after the projection. On the side, we will also plot the histogram of the ratio of these distances.

Code
plt.rcParams['figure.figsize'] = [8, 8]
Code
# Generate a sparse data-matrix
def get_sparse_data(d, n):
    return rng.choice(
        [0, 1],
        p=[0.8, 0.2],
        size=(d, n)
    )
Code
# Initialization
d, n = 2_000, 1000
X = get_sparse_data(d, n)
orig_dist = get_dist(X)
eps = 0.1
k = get_k_theory(n, eps)

print(f'Number of data-points = {n}')
print(f'Dimension of parent space = {d}')
print(f'Distortion limit, eps = {eps}')
print(f'Dimension of projected space = {k}')
Number of data-points = 1000
Dimension of parent space = 2000
Distortion limit, eps = 0.1
Dimension of projected space = 5527
Code
# Run experiment for different values of k
for i, k in enumerate([100, 400, 1000]):
    M = get_matrix(k, d)
    X_t = transform(X, M)
    orig_dist = get_dist(X)
    trans_dist = get_dist(X_t)
    scale = max(orig_dist.max(), trans_dist.max())
    ratio_dist = get_ratio(orig_dist, trans_dist)

    # Scatter of original-transformed pair-wise distances
    plt.subplot(3, 2, 2*i + 1)
    x_ = np.linspace(0, 1, 2)
    plt.plot(x_, x_, alpha=0.9, linestyle='--', color='black')
    plt.axhline(linestyle='--', color='black',
                alpha=0.5)
    plt.axvline(linestyle='--', color='black',
                alpha=0.5)
    plt.scatter(orig_dist/scale, trans_dist/scale,
                alpha=0.05)
    plt.xlabel('Original distances')
    plt.ylabel('Transformed distances')
    plt.axis('equal')
    plt.title(f'k = {k}')

    # Histogram of distance ratios
    plt.subplot(3, 2, 2*i + 2)
    plt.hist(ratio_dist,
             bins=list(np.linspace(0, 2, 20)),
             edgecolor='black')
    plt.title('Histogram of ' + r'$\cfrac{d_{\text{trans}}}{d_{\text{orig}}}$')
    plt.tight_layout()

Code
ratio_dist.max()
np.float64(1.2226945370508673)
Code
orig_dist.shape
(499500,)
Code
orig_dist[0]
np.float64(653.0)

Visualizing random projections

We will now study the effect of multiplying a random matrix with a vector. To keep things simple, we will consider a \(2 \times 2\) matrix \(M\), all of whose entries are sampled from \(N(0, 1)\) and see what it does the unit vector \(e = (1, 0)\). This would require us to sample a number of random matrices and see where \(Me\) lands.

Code
def mass_transform(m):
    proj = np.zeros((2, m))
    ang = np.zeros((m, ))
    for i in range(m):
        M = rng.normal(size=(2,2))/(2**0.5)
        proj[:, i] = M[:, 0]
        ang[i] = np.arccos(
            proj[:, i][0]/np.linalg.norm(proj[:, i])
        )*(360/(2*np.pi))
        if proj[:, i][1] < 0:
            ang[i] *= (-1)
    return proj, ang
Code
plt.rcParams['figure.figsize'] = [6, 6]
Code
# Distribution of Me
def plot_transforms():
    # plt.subplots(2, 2, constrained_layout=True)
    for i, m in enumerate([10, 100, 1_000, 10_000]):
        alpha = [0.5, 0.2, 0.1, 0.05]
        plt.subplot(2, 2, i + 1)
        th = np.linspace(0, 2*np.pi, 100)
        plt.plot(np.cos(th), np.sin(th),
                linestyle='--', color='black',
                alpha=0.5)
        plt.axhline(linestyle='--', color='black',
                    alpha=0.5)
        plt.axvline(linestyle='--', color='black',
                    alpha=0.5)
        plt.scatter([1], [0], color='red', alpha=1)

        proj, ang = mass_transform(m)
        plt.scatter(proj[0], proj[1], alpha=alpha[i])
        plt.xlim([-2, 2])
        plt.ylim([-2, 2])
        plt.title(f'{m} matrices')
    # plt.tight_layout()
plot_transforms()

We will now study the distribution of \(||Me||^2\) by considering \(100,000\) random samples. Along with this, we will also study how the angle between \(e\) and \(Me\) is distributed.

Code
m = 100_000
proj, ang = mass_transform(m)
Code
# Histogram of squared norm of Me
plt.rcParams['figure.figsize'] = [4, 4]
norm_sq = np.linalg.norm(proj, axis=0)**2
plt.hist(norm_sq, bins=100, edgecolor='black')
plt.title(f'{m} matrices, ' +
          r'$\mathbb{E}\left[||Me||^2 \right]$ ' +
          f'= {round(norm_sq.mean(), 3)}');

Code
# Histogram of angle between e and Me
plt.rcParams['figure.figsize'] = [4, 4]
plt.hist(ang, bins=30, edgecolor='black')
plt.title('Angle between $e$ and $Me$, in deg');

Interplay between \(\epsilon, n\) and \(k\)

Recall that \(k \geqslant \cfrac{8 \log(n)}{\epsilon^2}\). We will use this bound to see how these three quantities impact each other.

Code
# Plot of k_min vs n
plt.rcParams['figure.figsize'] = [5, 5]

# range of admissible distortions
eps_range = np.linspace(0.1, 0.99, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(eps_range)))

# range of number of samples (observation) to embed
n_samples_range = np.logspace(1, 9, 9)

for eps, color in zip(eps_range, colors):
    k_min = 8 * np.log(n_samples_range) / (eps**2)
    plt.loglog(n_samples_range, k_min, color=color)

plt.legend([f"eps = {eps:0.1f}" for eps in eps_range], loc="lower right")
plt.xlabel(r"$n$")
plt.ylabel(r"$k_{\text{min}}$")
plt.title(r"JL bounds: $k_{\text{min}}$ vs $n$");

Code
# range of admissible distortions
eps_range = np.linspace(0.01, 0.99, 100)

# range of number of samples (observation) to embed
n_range = np.logspace(2, 6, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(n_range)))

plt.figure()
for n, color in zip(n_range, colors):
    k_min = 8 * np.log(n) / (eps_range**2)
    plt.semilogy(eps_range, k_min, color=color)

plt.legend([r'$n$' + f' = {int(n)}' for n in n_range], loc="upper right")
plt.xlabel(r"$\epsilon$")
plt.ylabel(r"$k_{\text{min}}$")
plt.title(r"JL bounds: $k_{\text{min}}$ vs $\epsilon$")
plt.show()