PCA Preprocessing¶

PCA is a standard step in many RNA-seq analysis pipelines, where it serves as a denoising and dimensionality-reduction layer ahead of downstream tasks like clustering and visualization. Glass Box UMAP exposes a pca_components kwarg that folds this step into fit / transform, so raw features can be passed through directly.

This notebook walks through what pca_components does to feature contributions, and how to reproduce its behavior by hand if you’d rather control the PCA step yourself.

We’ll use the Bhattacharjee 2001 lung cancer microarray dataset for the demo. It’s hosted on OpenML so loading is a one-liner.

Warning

We do not take a position on whether PCA preprocessing is the right choice for your data. We simply recognize that it is a standard, field-specific step that many pipelines already include, and show how Glass Box UMAP fits into that workflow either way.

Loading the data¶

fetch_openml downloads and caches the dataset. Class labels come back as integer codes. We map them to the lung subtype names from the original paper, inferred by matching class sizes.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

CLASS_NAME_BY_SIZE = {
    139: "Adenocarcinoma",
    21: "Squamous cell",
    20: "Carcinoid",
    17: "Normal lung",
    6: "Small cell",
}

lung = fetch_openml(data_id=45093, as_frame=True, parser="auto")
df = lung.frame
y_int = df["type"].astype(int).to_numpy()
counts = pd.Series(y_int).value_counts().to_dict()
int_to_name = {code: CLASS_NAME_BY_SIZE[n] for code, n in counts.items()}
labels = np.array([int_to_name[c] for c in y_int])

features = df.drop(columns="type")
feature_names = features.columns.tolist()
X = np.log2(np.maximum(features.to_numpy(dtype=np.float32), 1.0))

print(f"X shape: {X.shape}")
print(f"classes: {sorted(set(labels))}")

X shape: (203, 12600)
classes: ['Adenocarcinoma', 'Carcinoid', 'Normal lung', 'Small cell', 'Squamous cell']

Approach 1: let `pca_components` handle it¶

Set pca_components and pass the raw input straight in. Glass Box UMAP fits its own PCA inside fit, projects the input down, and trains the encoder on the reduced features.

from glass_box_umap import GlassBoxUMAP

reducer_a = GlassBoxUMAP(
    pca_components=50,
    random_state=42,
    quiet=True,
)
reducer_a.fit(X)
embedding_a = reducer_a.transform(X)
print(f"embedding shape: {embedding_a.shape}")

embedding shape: (203, 2)

Where do contributions live?¶

The encoder works in PCA space, but compute_contributions returns contributions in the original feature space. Glass Box UMAP projects the encoder’s Jacobian back through the PCA basis before forming the gradient-times-input product.

contributions_a = reducer_a.compute_contributions(X)
print(f"contributions shape: {contributions_a.shape}")

contributions shape: (203, 2, 12600)

We can use the bokeh plotting helper bundled with Glass Box UMAP to inspect the embedding interactively.

Install the plotting extras

glass_box_umap.plotting is an optional dependency, that can be installed like so:

pip install "glass-box-umap[plotting]"
# or
uv pip install "glass-box-umap[plotting]"

from glass_box_umap.plotting import output_notebook, plot_embedding, show

output_notebook(hide_banner=True)

show(
    plot_embedding(
        Z=embedding_a,
        contributions=contributions_a,
        group_names=labels,
        feature_names=feature_names,
        feature_values=X,
    )
)

Approach 2: do PCA yourself, then use `project_jacobian`¶

If you already PCA-reduce upstream of UMAP (a common shape for RNA-seq pipelines), you can skip pca_components and pass the reduced features in directly. The encoder is then trained on those features, and compute_contributions returns contributions in PCA component space instead of the original feature space.

from sklearn.decomposition import PCA

pca = PCA(n_components=50, random_state=42)
X_pca = pca.fit_transform(X).astype(np.float32)

reducer_b = GlassBoxUMAP(
    pca_components=None,
    random_state=42,
    quiet=True,
)
reducer_b.fit(X_pca)
embedding_b = reducer_b.transform(X_pca)

contributions_b = reducer_b.compute_contributions(X_pca)
print(f"X_pca shape: {X_pca.shape}")
print(f"contributions shape: {contributions_b.shape}")

X_pca shape: (203, 50)
contributions shape: (203, 2, 50)

To get back to the original feature space, replicate what pca_components does internally. The encoder was trained on mean-centered PCA features. Sklearn’s PCA mean-centers its input by construction, so X_pca is already near-zero-mean, and Glass Box UMAP applies its own centering pass on top of whatever you hand it. We mirror that here, subtracting X_pca.mean(axis=0) before computing the Jacobian so we’re querying the encoder at the same point it saw during training. From there, project_jacobian chains the PCA components matrix onto the resulting Jacobian, taking it back to the original feature space. Multiplying by the mean-centered input then gives feature-space contributions.

compute_jacobian / project_jacobian API

From the API docs:

compute_jacobian(model: Module, x: Tensor, batch_size: int = 1024) → Tensor[source]

Compute the Jacobian of a model using vmap + jacrev with functional_call.

Parameters:

model¶ -- Encoder network (will be deep-copied and set to eval mode).
x¶ -- Input tensor of shape (n, in_dim).
batch_size¶ -- Number of samples per Jacobian batch.

Returns:

Jacobian tensor of shape (n, out_dim, in_dim).

Return type:

Tensor

project_jacobian(jacobian: Tensor, proj_tensor: Tensor) → Tensor[source]

Map a Jacobian’s input axis through a linear projection.

Used to express a Jacobian computed in a reduced input space (e.g. PCA components) in terms of the original features, by right-multiplying with the projection matrix that maps reduced-space inputs back to original features.

Parameters:

jacobian¶ -- Jacobian tensor of shape (n, out_dim, in_dim_reduced).
proj_tensor¶ -- Projection matrix of shape (in_dim_reduced, in_dim_original), e.g. pca.components_.

Returns:

Jacobian of shape (n, out_dim, in_dim_original).

Return type:

Tensor

import torch
from glass_box_umap.jacobian import project_jacobian

X_pca_centered = torch.from_numpy(X_pca - X_pca.mean(axis=0)).to(reducer_b._device)
J_pca = reducer_b.compute_jacobian(X_pca_centered)

P = torch.tensor(pca.components_, dtype=torch.float32, device=reducer_b._device)
J_features = project_jacobian(J_pca, P)

X_centered = torch.from_numpy(X - pca.mean_).to(reducer_b._device).unsqueeze(1)
contributions_b_features = (J_features * X_centered).cpu().numpy()
print(f"recovered contributions shape: {contributions_b_features.shape}")

recovered contributions shape: (203, 2, 12600)

The recovered contributions are nearly identical to the ones approach 1 produced.

mean_abs_diff = np.mean(np.abs(contributions_a - contributions_b_features))
print(f"mean abs diff: {mean_abs_diff:.7f}")

mean abs diff: 0.0000182

Takeaways¶

pca_components is the convenient default. Pass raw features in, and compute_contributions returns contributions in the same raw feature space. If you’d rather run PCA yourself, Glass Box UMAP will train on the PCA features as is, and contributions will live in PCA component space. To get back to feature-space contributions in that case, project from PCA space yourself with compute_jacobian and project_jacobian. That’s the same recipe Glass Box UMAP runs internally when pca_components is set.