{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PCA Preprocessing\n", "\n", "PCA is a standard step in many RNA-seq analysis pipelines, where it serves as a denoising and dimensionality-reduction layer ahead of downstream tasks like clustering and visualization. Glass Box UMAP exposes a `pca_components` kwarg that folds this step into `fit` / `transform`, so raw features can be passed through directly.\n", "\n", "This notebook walks through what `pca_components` does to feature contributions, and how to reproduce its behavior by hand if you'd rather control the PCA step yourself.\n", "\n", "We'll use the [Bhattacharjee 2001 lung cancer microarray dataset](https://www.pnas.org/doi/abs/10.1073/pnas.191502998) for the demo. It's hosted on OpenML so loading is a one-liner.\n", "\n", ":::{warning}\n", "We do not take a position on whether PCA preprocessing is the right choice for your data. We simply recognize that it is a standard, field-specific step that many pipelines already include, and show how Glass Box UMAP fits into that workflow either way.\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n", "\n", "`fetch_openml` downloads and caches the dataset. Class labels come back as integer codes. We map them to the lung subtype names from the original paper, inferred by matching class sizes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-05-11T22:17:21.685189Z", "iopub.status.busy": "2026-05-11T22:17:21.685110Z", "iopub.status.idle": "2026-05-11T22:17:22.698903Z", "shell.execute_reply": "2026-05-11T22:17:22.698430Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X shape: (203, 12600)\n", "classes: ['Adenocarcinoma', 'Carcinoid', 'Normal lung', 'Small cell', 'Squamous cell']\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import fetch_openml\n", "\n", "CLASS_NAME_BY_SIZE = {\n", " 139: \"Adenocarcinoma\",\n", " 21: \"Squamous cell\",\n", " 20: \"Carcinoid\",\n", " 17: \"Normal lung\",\n", " 6: \"Small cell\",\n", "}\n", "\n", "lung = fetch_openml(data_id=45093, as_frame=True, parser=\"auto\")\n", "df = lung.frame\n", "y_int = df[\"type\"].astype(int).to_numpy()\n", "counts = pd.Series(y_int).value_counts().to_dict()\n", "int_to_name = {code: CLASS_NAME_BY_SIZE[n] for code, n in counts.items()}\n", "labels = np.array([int_to_name[c] for c in y_int])\n", "\n", "features = df.drop(columns=\"type\")\n", "feature_names = features.columns.tolist()\n", "X = np.log2(np.maximum(features.to_numpy(dtype=np.float32), 1.0))\n", "\n", "print(f\"X shape: {X.shape}\")\n", "print(f\"classes: {sorted(set(labels))}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Approach 1: let `pca_components` handle it\n", "\n", "Set `pca_components` and pass the raw input straight in. Glass Box UMAP fits its own PCA inside `fit`, projects the input down, and trains the encoder on the reduced features." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-05-11T22:17:22.700238Z", "iopub.status.busy": "2026-05-11T22:17:22.700137Z", "iopub.status.idle": "2026-05-11T22:17:37.899981Z", "shell.execute_reply": "2026-05-11T22:17:37.899582Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "embedding shape: (203, 2)\n" ] } ], "source": [ "from glass_box_umap import GlassBoxUMAP\n", "\n", "reducer_a = GlassBoxUMAP(\n", " pca_components=50,\n", " random_state=42,\n", " quiet=True,\n", ")\n", "reducer_a.fit(X)\n", "embedding_a = reducer_a.transform(X)\n", "print(f\"embedding shape: {embedding_a.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Where do contributions live?\n", "\n", "The encoder works in PCA space, but `compute_contributions` returns contributions in the **original feature space**. Glass Box UMAP projects the encoder's Jacobian back through the PCA basis before forming the gradient-times-input product." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-05-11T22:17:37.901418Z", "iopub.status.busy": "2026-05-11T22:17:37.901312Z", "iopub.status.idle": "2026-05-11T22:17:37.997047Z", "shell.execute_reply": "2026-05-11T22:17:37.996688Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "contributions shape: (203, 2, 12600)\n" ] } ], "source": [ "contributions_a = reducer_a.compute_contributions(X)\n", "print(f\"contributions shape: {contributions_a.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the bokeh plotting helper bundled with Glass Box UMAP to inspect the embedding interactively.\n", "\n", ":::{admonition} Install the plotting extras\n", ":class: tip dropdown\n", "\n", "`glass_box_umap.plotting` is an optional dependency, that can be installed like so:\n", "\n", "```bash\n", "pip install \"glass-box-umap[plotting]\"\n", "# or\n", "uv pip install \"glass-box-umap[plotting]\"\n", "```\n", ":::" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-05-11T22:17:37.998379Z", "iopub.status.busy": "2026-05-11T22:17:37.998290Z", "iopub.status.idle": "2026-05-11T22:17:38.315185Z", "shell.execute_reply": "2026-05-11T22:17:38.314808Z" } }, "outputs": [ { "data": { "application/javascript": [ "'use strict';\n", "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " const force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", "const JS_MIME_TYPE = 'application/javascript';\n", " const HTML_MIME_TYPE = 'text/html';\n", " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " const CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " const script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " function drop(id) {\n", " const view = Bokeh.index.get_by_id(id)\n", " if (view != null) {\n", " view.model.document.clear()\n", " Bokeh.index.delete(view)\n", " }\n", " }\n", "\n", " const cell = handle.cell;\n", "\n", " const id = cell.output_area._bokeh_element_id;\n", " const server_id = cell.output_area._bokeh_server_id;\n", "\n", " // Clean up Bokeh references\n", " if (id != null) {\n", " drop(id)\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd_clean, {\n", " iopub: {\n", " output: function(msg) {\n", " const id = msg.content.text.trim()\n", " drop(id)\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd_destroy);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " const output_area = handle.output_area;\n", " const output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " const bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " const script_attrs = bk_div.children[0].attributes;\n", " for (let i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " const toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " const events = require('base/js/events');\n", " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " const NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"