glass_box_umap

Subpackages

Submodules

Overview

Classes

GlassBoxUMAP

Glass Box UMAP model.

ParametricUMAP

Parametric UMAP model.

Classes

class GlassBoxUMAP(*, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', n_components: int = 2, negative_sample_rate: int = 5, repulsion_strength: float = 1.0, pca_components: int | None = None, encoder_name: str = 'default', encoder_kwargs: dict[str, Any] = dict(), lr: float = 0.001, epochs: int = 200, batch_size: int = 10000, num_batches: int | None = None, num_workers: int = 0, checkpoint_dir: Path | None = None, restore_best_weights: bool = True, random_state: int | None = None, quiet: bool = False, extra_callbacks: list[pl.Callback] = list())[source]

Glass Box UMAP model.

Base Classes:

ParametricUMAP

Attributes:

n_neighbors

Number of nearest neighbors used to construct the high-dimensional graph.

min_dist

Minimum distance between points in the low-dimensional embedding.

metric

Distance metric used for computing nearest neighbors.

n_components

Dimensionality of the learned embedding.

random_state

Random seed for reproducibility. If None, no seed is set.

encoder_kwargs

Additional keyword arguments passed to the encoder constructor.

pca_components

Number of PCA components for input preprocessing. If None, no PCA is applied. PCA requires 2D input (n_samples, n_features); leave this None when fitting on multi-dimensional data (e.g. images for a convolutional encoder).

lr

Learning rate for the optimizer.

epochs

Number of training epochs.

batch_size

Batch size for training and (default) inference.

negative_sample_rate

Number of negative samples per positive edge in the UMAP loss.

repulsion_strength

Weighting of the repulsive term in the UMAP loss.

num_workers

Number of data loading workers.

checkpoint_dir

Directory for saving training checkpoints. If None, a temporary directory is used.

Methods:

compute_contributions(X: NDArray[floating] | Tensor, batch_size: int | None = None, reduction: Literal['l2'] | None = None) NDArray[float32][source]

Compute per-feature contributions to the embedding via Gradient x Input.

Projects gradients back to raw feature space if PCA preprocessing was used.

Parameters:
  • X : NDArray[floating] | Tensor

    The input data (same format as passed to fit/transform). Shape: (n_samples, n_features).

  • batch_size : int | None

    Batch size for Jacobian computation. Defaults to self.batch_size.

  • reduction : Literal['l2'] | None

    How to reduce contributions across embedding dimensions. If "l2", takes the L2 norm across components, returning shape (n_samples, n_features). If None, returns the full (n_samples, n_components, n_features) array.

Returns:

Feature contributions array. Shape is (n_samples, n_components, n_features) when reduction is None, or (n_samples, n_features) when a reduction is applied.

Return type:

NDArray[float32]

compute_jacobian(x: Tensor, batch_size: int = 1024) Tensor[source]

Compute the Jacobian of a model using vmap + jacrev with functional_call.

See glass_box_umap.jacobian.compute_jacobian() for details.

Return type:

Tensor

class ParametricUMAP(*, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', n_components: int = 2, negative_sample_rate: int = 5, repulsion_strength: float = 1.0, pca_components: int | None = None, encoder_name: str = 'default', encoder_kwargs: dict[str, Any] = dict(), lr: float = 0.001, epochs: int = 200, batch_size: int = 10000, num_batches: int | None = None, num_workers: int = 0, checkpoint_dir: Path | None = None, restore_best_weights: bool = True, random_state: int | None = None, quiet: bool = False, extra_callbacks: list[Callback] = list())[source]

Parametric UMAP model.

Attributes:

n_neighbors : int

Number of nearest neighbors used to construct the high-dimensional graph.

min_dist : float

Minimum distance between points in the low-dimensional embedding.

metric : str

Distance metric used for computing nearest neighbors.

n_components : int

Dimensionality of the learned embedding.

negative_sample_rate : int

Number of negative samples per positive edge in the UMAP loss.

repulsion_strength : float

Weighting of the repulsive term in the UMAP loss.

pca_components : int | None

Number of PCA components for input preprocessing. If None, no PCA is applied. PCA requires 2D input (n_samples, n_features); leave this None when fitting on multi-dimensional data (e.g. images for a convolutional encoder).

encoder_name : str

Name of the registered encoder architecture.

encoder_kwargs : dict[str, Any]

Additional keyword arguments passed to the encoder constructor.

lr : float

Learning rate for the optimizer.

epochs : int

Number of training epochs.

batch_size : int

Batch size for training and (default) inference.

num_batches : int | None

Cap the number of batches per epoch. Useful for large graphs where a full pass would be prohibitively long. If None, trains on all batches.

num_workers : int

Number of data loading workers.

checkpoint_dir : Path | None

Directory for saving training checkpoints. If None, a temporary directory is used.

restore_best_weights : bool

If True, restore the model weights from the epoch with the lowest loss after training. If False, keep the weights from the final epoch.

random_state : int | None

Random seed for reproducibility. If None, no seed is set.

quiet : bool

If True, suppress Lightning logs and progress output.

extra_callbacks : list[pl.Callback]

Additional Lightning callbacks to attach to the trainer.

Methods:

to(device: str | device) Self[source]

Move the model (if initialized) and update the target device.

Return type:

Self

fit(X: NDArray[floating] | Tensor) Self[source]
Return type:

Self

transform(X: NDArray[floating] | Tensor, batch_size: int | None = None) NDArray[floating][source]
Return type:

NDArray[floating]

fit_transform(X: NDArray[floating] | Tensor) NDArray[floating][source]
Return type:

NDArray[floating]

save(path: Path) None[source]
classmethod load(path: Path) Self[source]
Return type:

Self