In this demo, we’ll be analyzing 5K-plex Xenium (10x Genomics) data on a tissue section from a human breast cancer biopsy, derived from here. These include the following elements:
two images (histology and immunofluorescence)
two shapes (nuclei and membrane boundaries)
one table (gene \(\times\) cell data)
one point (RNA transcripts)
Preamble
Dataset
A size-reduced example dataset has been deposited on Zenodo. We’ll retrieve the compressed SpatialData.zarr store together with a reference scRNA-seq dataset in .h5ad format (Janesick et al. 2023), decompress both archives, and stash their on-disk file paths for later:
Let’s start out by visualization the hematoxylin and eosin (H&E) staining. H&E provides a high-contrast visualization of tissue microarchitecture and cellular morphology through differential staining of nuclear and cytoplasmic components. To date, whole-slide H&E images (WSIs) remain the clinical gold standard for cancer diagnosis, including tumor grading, staging, and prognosis.
plotSpatialData() +plotImage(sd)
We can also zoom into a particular region by specifying a bounding box and using crop() to subset the data accordingly across all layers:
The underlying image is a 3D array where the first dimension are channels, and the second and third dimensions are height and width; we can verify this by viewing the axes() specified in the .zattrs associated with the image:
do.call(rbind, axes(img))
name type
[1,] "c" "channel"
[2,] "y" "space"
[3,] "x" "space"
cat(dim(img))
4 2111 1787
Let’s visualize the four-plex composite image overlaying these markers; for consistent visualization, we also specify colors to use for each channel:
Alternatively, we can visualize each channel separately; here, using the smaller region define above.
Note that plotImage() will auto-adjust contrast limits for each channel when rendering images. Channels might hence appear differently when visualized jointly/separately.
# plot channels one by onelapply(chs, \(.) {plotSpatialData() +ggtitle(.) +plotImage(sp, i=2, ch=., c=pal[.])}) |># arrange in 2x2 gridwrap_plots(nrow=2) &theme_void() &theme(legend.position="none",plot.title=element_text(hjust=0.5))
Segmentation
We can also zoom in further to inspect how well cell segmentation boundaries align with histopathology and the immunofluorescent signals used to get them:
# crop to 2,000 x 3,000 bounding boxbb <-list(xmin=15e3, xmax=17e3,ymin=42e3, ymax=45e3)sq <-crop(sd, bb)# visualize HE & IF side-by-sideplotSpatialData() +plotImage(sq, 1) +plotShape(sq, fill=NA, col="black", linewidth=1/5) +plotSpatialData() +plotImage(sq, 2, ch=chs, c=pal[chs]) +plotShape(sq, fill=NA, col="white", linewidth=1/5)
We can also go a step further, visualizing both cell nuclei (in gold) and membrane (in grey) boundaries togeher on an ever smaller region:
Going all the way, let’s zoom in to a couple cells in order to visualize 4,000 individual transcript molecules as well; we’ll color these by whether or not they have been determined to overlaps_nucleus (according to 10x):
Visual inspection is essential. Systematic assessment of transcript-to-cell assignments would, however, require inspection of transcript localization in 3D across thousands of genes, which is infeasible for modern high-plex panels. SegTraQ provides a collection of metrics for Segmentation and Transcript Assignment Quality Control.
While SegTraQ is implemented in Python, reticulate and Quarto enable us to execute both programming languages in one session. We can thus compute QC metrics in Python and subsequently switch back to R.
Python setup
We can instruct our R session to use an existing Python environment with our required dependencies (here, SegTraQ) via reticulate. Below we demonstrate two common approaches:
Alternatively, we can use a local virtual environment created with uv (install via brew install uv) and point reticulate directly to the corresponding Python executable:
SegTraQ organizes metrics into six modules. Here, we begin with the baseline (bl) module, which computes a set of per-cell and -sample summary statistics such as transcript counts, detected genes, cell area, and gene-wise fraction of unassigned transcripts. Metrics can be computed either individually (e.g., st.bl.transcripts_per_cell()), or jointly by running the full bl module with st.run_baseline().
st = segtraq.SegTraQ(sdata, tables_centroid_x_key="centroid_x", tables_centroid_y_key="centroid_y", images_key="morphology_focus")st.run_baseline()
Of the six SegTraQ modules, only the supervised (sp) module requires scRNA-seq reference data. Label transfer can either be performed using a lightweight SegTraQ helper, or can be supplied by the user directly. Here, we use the scRNA-seq reference from Janesick et al. (2023) retrieved above and transfer labels to the spatial data using the SegTraQ helper.
We use this reference to derive positive and negative cell type markers – i.e., genes expected to be present or absent in a given cell type – and pass these to the sp module in order to identify potentially contaminated cells. For demonstration, we compute two metrics from the sp module.
# subset to common genesadata = st.sdata.tables["table"]common_genes = adata.var_names[adata.var_names.isin(ref.var_names)]ref = ref[:, common_genes].copy()# get marker genesmarkers = segtraq.markers_from_reference( ref, cell_type_key="celltype", n_jobs=16)# compute exemplary metrics_ = st.sp.neighbor_contamination( markers=markers, cell_type_key="transferred_cell_type")
[34mINFO [0m Creating graph using `generic` coordinates and `[3;35mNone[0m` transform and `[1;36m1[0m`
libraries.
In the absence of reference scRNA-seq data, SegTraQ provides the region similarity (rs) module that identifies contaminated cells based on the expression similarity between subcellular regions and their local neighborhood. For demonstration, we here view one exemplary metric from the rs module.
We can visualize the H&E image, the immunofluorescence channel DAPI, as well as cell boundaries colored by the transferred cell type label or SegTraQ metrics.
[33mWARNING [0m render_shapes: Found [1;36m3591[0m NaN values in color data. These observations
will be colored with the [32m'na_color'[0m.
After computing required QC metrics in Python, we’ll move back to R to explore and visualize these in more detail.
Back to R
Let’s retrieve the table – including SegTraQ results – from Python, and transfer it to the SpatialData object in R. We set the table to annotate cell boundaries using setTable(); here, ik specifies the spatialdata’s instance_key used to link observations between table and spatial element.
# set Python table to annotate # cell segmentation boundariesse <- py$adata$as_SingleCellExperiment()# remove old tablesd <- sd[c("images", "points", "shapes")](sd <-setTable(sd, y=se, i="cell_boundaries", ik="cell_id"))
Around 9.3% of transcripts were not assigned to any cell.
The mean transcript count per cell is 145.1.
The mean gene count per cell is 99.9.
It is also relevant to inspect the mean transcript count per gene per cell. We’ll define a wrapper .boxplot to plot features per cell type as boxplots:
Stromal cells show higher elongation, whereas T and B cells tend to have less elongated shapes.
On median, cells express only 1–2 transcripts per gene.
We can also inspect the transcript_density, i.e., transcript abundance realtive to cellular area. Some workflows (see below) recommend filtering cells based on transcript density. However, this can bias filtering towards specific cell types, such as macrophages, mast cells, or dendritic cells, which tend to be large but exhibit comparatively low transcript counts.
After inspecting baseline metrics, we next assess potential contamination. Here, SegTraQ summarizes directed contamination between source and target cell types based on negative marker expression.
NA values indicate pairs for which contamination can’t be assessed, either because the cell types are not spatial neighbors, or because they do not share informative positive/negative marker combinations.
The dataset contains only few tumor cells (about 0.53%), which would explain why little or no contamination from them is being detected.
About 19% of myoepithelial cells appear to be contaminated by DCIS2 cells, which are common spatial neighbors. 3% of T cells are contaminated by B cells.
For demonstration, let’s inspect mutually exclusive markers between B and T cells. Here, we recover well-known B cell markers such as MS4A1 and CD19, which are not expected to be expressed in T cells.
Let’s visualize the expression of MS4A1 in T and B cells. To this end, we crop the dataset to a region with a high density of these cell types. Visually, MS4A1 transcripts appear closer to the cell membrane in T cells than in B cells, which would be consistent with local transcript spillover.
# crop to 2,000 x 2,000 bounding boxbb <-list(xmin=14e3, xmax=16e3,ymin=45e3, ymax=47e3)sq <-crop(sd, bb)# visualize MS4A1 transcripts over labeled cellsplotSpatialData() +plotShape(sq, col=NA, fill="transferred_cell_type", alpha=0.8) +scale_fill_manual(values=pal, na.value="#808080") +plotPoint(sq, key="MS4A1", size=0.4) +ggtitle("MS4A1 transcripts in T/B cells")
Let’s evaluate this with SegTraQ, which provides the point statistics (ps) module to quantify subcellular transcript distributions within cells. For example, we can compute transcript distances to the cell membrane either across all genes and cells or for selected genes and cell types. For this, let’s briefly jump back into Python.
Let’s jump back to R and plot the distance of MS4A1 in B and T cells. Indeed, MS4A1 is closer to the cell membrane in T than in B cells. This aligns with these signals arising from spillover rather than true biological expression:
se <- py$adata$as_SingleCellExperiment()sd <-setTable(sd, y=se, i="cell_boundaries", ik="cell_id")
SegTraQ can support the identification and filtering of low-quality cells or samples based on morphological abnormalities, low transcript counts, or increased contamination. It can also be used to benchmark segmentation methods and help identify the most suitable segmentation strategy for a given dataset.
To explore additional SegTraQ metrics, please refer to the SegTraQ documentation. SegTraQ is still under active development; additional metrics and improvements will keep being introduced in future releases.
Processing
The SpatialData object contains a single table element – represented as a SingleCellExperiment – annotating cell_boundaries. This table contains the single-cell level data: a gene \(\times\) cell count matrix (assay slot "X"), gene/cell metadata (row/colData), as well as spatial coordinates (reducedDim slot "spatial").
Because data are represented as a SCE, we can perform standard processing tasks using tools available in R/Bioc (e.g., scrapper) such as filtering little informative cells based on, e.g., transcript_density:
# exclude cells <2 MADs in log-counts per areacpa <-log(se$transcript_counts/se$cell_area)mad <-mad(cpa, na.rm=TRUE)med <-median(cpa, na.rm=TRUE)se$ex <-is.na(cpa) | (cpa < med-2*mad)
We’ll here filter out cells with comparatively low transcript density, and proceed with processing steps standard for scRNA-seq data, namely:
log-library size normalization
highly variable gene (HVG) selection
principal component analysis (PCA)
shared-nearest neighbor (SNN) graph construction
community detection using the Leiden algorithm
base::table(se$ex)
FALSE TRUE
16953 1673
se <- se[, !se$ex]assayNames(se) <-"counts"se <-normalizeRnaCounts.se(se)se <-chooseRnaHvgs.se(se)se <-runPca.se(se, features=rowData(se)$hvg)se <-clusterGraph.se(se, method="leiden", resolution=0.5)nk <-nlevels(se$clusters)table(sd) <- se
Let’s visualize the resulting cluster assignments spatially; from this, we can observe that malignant regions are mostly made up of a single (large) cluster:
The R/Bioconductor package sosta(Gunz, Crowell, and Robinson 2025) provides tools to identify and quantitatively characterize multi-cellular anatomical structures by modeling cells as a ‘marker’ point process, and delineating spatial structures based on a threshold on cellular ‘intensity’.
Structures could, for example, represent germinal centers (GC) in tonsil, metastases or tertiary lymphoid structures (TLS) in cancer, etc.
library(sosta)library(spatstat.geom)
We’ll start by construction a point pattern process (ppp) of cell centroids, using cluster assignments as marks. Note that we first transform boundaries to align with the coordinate space shared between all elements; alternatively, this could be done after reconstruction of spatial structures.
We can now use cellular density (more precisely, the ppp’s intensity) to identify multi-cellular anatomical substructures using sosta; we’ll then assign these to our object as a new shape element:
use <-which.max(table(se$clusters))roi <-reconstructShapeDensity(pp, markSelect=use, dim=400)st_geometry(roi) <-"geometry"roi <-SpatialDataShape(roi)shape(sd, "roi") <- roi
Next, we can use regions to aggregate cell-level measurements into pseudo-bulks using mask() (argument how specifies how to do that). Masking will generate an additional genes \(\times\) regions table in which cell metadata (colData) i/n_instances are the indices/number of cells contributing to each aggregate, and the first observation (column "0") captures data on out-of-region cells.
Structures comprise 1-932 cells (144 on average), whereas 12058 cells have not been assigned to any region:
ns <- pb$n_instancesrange(ns[-1]) # per region
[1] 1 932
mean(ns[-1]) # on average
[1] 143.9706
ns[1] # unassigned
[1] 12058
While sosta has identified 34 regions, we’ll filter those that comprise at least 10 cells in order to rid ourselves of spurious findings:
pb <- pb[, pb$n_instances >=10]
Since we have tracked the indices of cells falling within each region, we can also inspect the composition of these regions in terms of cell types; here, using transferred_cell_type labels rather than Leiden clusters:
In a multi-section setting, one could for example compare structure-level pseudobulks across sections, in terms of differential expression or abundance of cell types (T cell exhaustion, immune infiltration, etc.).
AI tells us “X-box binding protein 1 (XBP1) is a key transcription factor, activated by the IRE1/XBP1 branch of the Unfolded Protein Response (UPR), which promotes breast cancer development, progression, and resistance to therapy.”; so let’s believe that and be proud.
Gunz, Samuel, Helena Lucia Crowell, and Mark D Robinson. 2025. “Analysis of anatomical multi-cellular structures from spatial omics data using sosta.”bioRxiv, 2025.10.13.682065. https://doi.org/10.1101/2025.10.13.682065.
Janesick, Amanda, Robert Shelansky, Andrew D Gottscho, Florian Wagner, Stephen R Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. “High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis.”Nature Communications 14 (1): 8353. https://doi.org/10.1038/s41467-023-43458-x.