In this demo, we’ll be analyzing 5K-plex Xenium (10x Genomics) data on a tissue section from a human breast cancer biopsy, derived from here. These include the following elements:
gene \(\times\) cell transcript counts
(based on “cell_boundaries”)
one point object:
individual RNA transcripts
(across 5,101 genes)
Preamble
Dataset
A size-reduced example dataset has been deposited on Zenodo. We’ll retrieve the compressed SpatialData.zarr store together with a reference scRNA-seq dataset in .h5ad format (Janesick et al. 2023), decompress both archives, and stash their on-disk file paths for later:
Let’s start out by visualization the hematoxylin and eosin (H&E) staining. H&E provides a high-contrast visualization of tissue microarchitecture and cellular morphology through differential staining of nuclear and cytoplasmic components. To date, whole-slide H&E images (WSIs) remain the clinical gold standard for cancer diagnosis, including tumor grading, staging, and prognosis.
plotSpatialData() +plotImage(sd)
We can also zoom into a particular region by specifying a bounding box and using crop() to subset the data accordingly across all layers:
The underlying image is a 3D array where the first dimension are channels, and the second and third dimensions are height and width; we can verify this by viewing the axes() specified in the .zattrs associated with the image:
do.call(rbind, axes(img))
name type
[1,] "c" "channel"
[2,] "y" "space"
[3,] "x" "space"
cat(dim(img))
4 2111 1787
Let’s visualize the four-plex composite image overlaying these markers; for consistent visualization, we also specify colors to use for each channel:
Alternatively, we can visualize each channel separately; here, using the smaller region define above.
Note that plotImage() will auto-adjust contrast limits for each channel when rendering images. Channels might hence appear differently when visualized jointly/separately.
# plot channels one by onelapply(chs, \(.) {plotSpatialData() +ggtitle(.) +plotImage(sp, i=2, ch=., c=pal[.])}) |># arrange in 2x2 gridwrap_plots(nrow=2) &theme_void() &theme(legend.position="none",plot.title=element_text(hjust=0.5))
Segmentation
We can also zoom in further to inspect how well cell segmentation boundaries align with histopathology and the immunofluorescent signals used to get them:
# crop to 2,000 x 3,000 bounding boxbb <-list(xmin=15e3, xmax=17e3,ymin=42e3, ymax=45e3)sq <-crop(sd, bb)# visualize HE & IF side-by-sideplotSpatialData() +plotImage(sq, 1) +plotShape(sq, fill=NA, col="black", linewidth=1/5) +plotSpatialData() +plotImage(sq, 2, ch=chs, c=pal[chs]) +plotShape(sq, fill=NA, col="white", linewidth=1/5)
We can also go a step further, visualizing both cell nuclei (in gold) and membrane (in grey) boundaries togeher on an ever smaller region:
Going all the way, let’s zoom in to a couple cells in order to visualize 4,000 individual transcript molecules as well; we’ll color these by whether or not they have been determined to overlaps_nucleus (according to 10x):
Important plotPoint arguments:
key = feature(s) to plot (used further below)
n = number of points to plot (will downsample at random)
optional ggplot aesthetics, such as color, alpha, etc.
Visual inspection is essential. Systematic assessment of transcript-to-cell assignments would, however, require inspection of transcript localization in 3D across thousands of genes, which is infeasible for modern high-plex panels. SegTraQ provides a collection of metrics for Segmentation and Transcript Assignment Quality Control.
While SegTraQ is implemented in Python, reticulate and Quarto enable us to execute both programming languages in one session. We can thus compute QC metrics in Python and subsequently switch back to R.
Python setup
We can instruct our R session to use an existing Python environment with our required dependencies (here, SegTraQ) via reticulate. Below we demonstrate two common approaches:
Alternatively, we can use a local virtual environment created with uv (install via brew install uv) and point reticulate directly to the corresponding Python executable:
SegTraQ organizes metrics into six modules. Here, we begin with the baseline (bl) module, which computes a set of per-cell and -sample summary statistics such as transcript counts, detected genes, cell area, and gene-wise fraction of unassigned transcripts. Metrics can be computed either individually (e.g., st.bl.transcripts_per_cell()), or jointly by running st.run_baseline().
st = segtraq.SegTraQ(sdata, tables_centroid_x_key="centroid_x", tables_centroid_y_key="centroid_y", images_key="morphology_focus")st.run_baseline()
Of the six SegTraQ modules, only the supervised (sp) module requires scRNA-seq reference data. Label transfer can either be performed using a lightweight SegTraQ helper, or can be supplied by the user directly. Here, we use the scRNA-seq reference from Janesick et al. (2023) retrieved above and transfer labels to the spatial data using the SegTraQ helper.
We use this reference to derive positive and negative cell type markers – i.e., genes expected to be present or absent in a given cell type – and pass these to the sp module in order to identify potentially contaminated cells. For demonstration, we compute two metrics from the sp module.
# subset to common genesadata = st.sdata.tables["table"]common_genes = adata.var_names[adata.var_names.isin(ref.var_names)]ref = ref[:, common_genes].copy()# get marker genesmarkers = segtraq.markers_from_reference( ref, cell_type_key="celltype", n_jobs=16)# compute exemplary metrics_ = st.sp.neighbor_contamination( markers=markers, cell_type_key="transferred_cell_type")
[34mINFO [0m Creating graph using `generic` coordinates and `[3;35mNone[0m` transform and `[1;36m1[0m`
libraries.
In the absence of reference scRNA-seq data, SegTraQ provides the region similarity (rs) module that identifies contaminated cells based on the expression similarity between subcellular regions and their local neighborhood. For demonstration, we here view one exemplary metric from the rs module.
We can visualize the H&E image, the immunofluorescence channel DAPI, as well as cell boundaries colored by the transferred cell type label or SegTraQ metrics.
[33mWARNING [0m render_shapes: Found [1;36m3591[0m NaN values in color data. These observations
will be colored with the [32m'na_color'[0m.
After computing required QC metrics in Python, we’ll move back to R to explore and visualize these in more detail.
Back to R
Let’s retrieve the table – including SegTraQ results – from Python, and transfer it to the SpatialData object in R. We set the table to annotate cell boundaries using setTable(); here, ik specifies the spatialdata’s instance_key used to link observations between table and spatial element.
# set Python table to annotate # cell segmentation boundariesse <- py$adata$as_SingleCellExperiment()# remove old tablesd <- sd[c("images", "points", "shapes")](sd <-setTable(sd, y=se, i="cell_boundaries", ik="cell_id"))
Around 9.3% of transcripts were not assigned to any cell.
The mean transcript count per cell is 145.1.
The mean gene count per cell is 99.9.
It is also relevant to inspect the mean transcript count per gene per cell. We’ll define a wrapper .boxplot to plot features per cell type as boxplots:
Stromal cells show higher elongation, whereas T and B cells tend to have less elongated shapes.
On median, cells express only 1–2 transcripts per gene.
We can also inspect the transcript_density, i.e., transcript abundance realtive to cellular area. Some workflows (see below) recommend filtering cells based on transcript density. However, this can bias filtering towards specific cell types, such as macrophages, mast cells, or dendritic cells, which tend to be large but exhibit comparatively low transcript counts.
After inspecting baseline metrics, we next assess potential contamination. Here, SegTraQ summarizes directed contamination between source and target cell types based on negative marker expression.
NA values indicate pairs for which contamination can’t be assessed, either because the cell types are not spatial neighbors, or because they do not share informative positive/negative marker combinations.
The dataset contains only few tumor cells (about 0.53%), which would explain why little or no contamination from them is being detected.
About 19% of myoepithelial cells appear to be contaminated by DCIS2 cells, which are common spatial neighbors. 3% of T cells are contaminated by B cells.
For demonstration, let’s inspect mutually exclusive markers between B and T cells. Here, we recover well-known B cell markers such as MS4A1 and CD19, which are not expected to be expressed in T cells.
Let’s visualize the expression of MS4A1 in T and B cells. To this end, we crop the dataset to a region with a high density of these cell types. Visually, MS4A1 transcripts appear closer to the cell membrane in T cells than in B cells, which would be consistent with local transcript spillover.
# crop to 2,000 x 2,000 bounding boxbb <-list(xmin=14e3, xmax=16e3,ymin=45e3, ymax=47e3)sq <-crop(sd, bb)# visualize MS4A1 transcripts over labeled cellsplotSpatialData() +plotShape(sq, col=NA, fill="transferred_cell_type", alpha=0.8) +scale_fill_manual(values=pal, na.value="#808080") +plotPoint(sq, key="MS4A1", size=0.4) +ggtitle("MS4A1 transcripts in T/B cells")
Let’s evaluate this with SegTraQ, which provides the point statistics (ps) module to quantify subcellular transcript distributions within cells. E.g., we can compute transcript-to-membrane distances either across genes and cells, or for selected genes and cell types. For this, we briefly jump back into Python.
Briefly, st.ps.distance_to_membrane quantifies transcript localization relative to cellular boundaries (e.g., nucleus or membrane) by computing distances of all transcripts to their reference polygon, normalizing for cell area, and reporting the average boundary distance for each transcript species. (Note: negative distances reflect localization outside of a cell’s polygon.)
Let’s jump back to R and plot the distance of MS4A1 in B and T cells. Indeed, MS4A1 is closer to the cell membrane in T than in B cells. This aligns with these signals arising from spillover rather than true biological expression:
se <- py$adata$as_SingleCellExperiment()sd <-setTable(sd, y=se, i="cell_boundaries", ik="cell_id")
SegTraQ can support the identification and filtering of low-quality cells or samples based on morphological abnormalities, low transcript counts, or increased contamination. It can also be used to benchmark segmentation methods and help identify the most suitable segmentation strategy for a given dataset.
To explore additional SegTraQ metrics, please refer to the SegTraQ documentation. SegTraQ is still under active development; additional metrics and improvements will keep being introduced in future releases.
Filtering
The SpatialData object contains a single table element – represented as a SingleCellExperiment – annotating cell_boundaries. This table contains the single-cell level data: a gene \(\times\) cell count matrix (assay slot "X"), gene/cell metadata (row/colData), as well as spatial coordinates (reducedDim slot "spatial").
Because data are represented as a SCE, we can perform standard processing tasks using tools available in R/Bioc (e.g., scrapper) such as filtering little informative cells based on, e.g., transcript_density:
# exclude cells <2 MADs in log-counts per areacpa <-log(se$transcript_counts/se$cell_area)mad <-mad(cpa, na.rm=TRUE)med <-median(cpa, na.rm=TRUE)se$ex <-is.na(cpa) | (cpa < med-2*mad)
We’ll here filter out cells with comparatively low transcript density, and proceed with processing steps standard for scRNA-seq data, namely:
log-library size normalization
highly variable gene (HVG) selection
principal component analysis (PCA)
shared-nearest neighbor (SNN) graph construction
community detection using the Leiden algorithm
base::table(se$ex)
FALSE TRUE
16953 1673
se <- se[, !se$ex]assayNames(se) <-"counts"se <-normalizeRnaCounts.se(se)se <-chooseRnaHvgs.se(se, top=2e3)se <-runPca.se(se, features=rowData(se)$hvg)se <-clusterGraph.se(se, method="leiden", resolution=0.4)nk <-nlevels(se$clusters)table(sd) <- se
Let’s visualize the resulting cluster assignments spatially; from this, we can observe that malignant regions are mostly made up of a single (large) cluster:
A common step in spatial omics data analysis is to evaluate the colocalization of cell types. This can be done using a variety of tools; for dydactic purposes, we’ll do in manually here by i) using the RANN to identify nearest neighbors (NNs) in a radial neighborhood around each cell; and ii) quantifying the relative abundance of cell types around each cell type.
library(RANN)
For such type of analysis, it’s important to get a sense of how many neighbors are typically found within a given radius; this can help decide on a meaningful distance threshold to use for colocalization analysis.
Caution: RANN’s nn2() function provides options for "standard" and distance-thresholded kNN search (searchtype="radius"). In any case, k determines the maximum number of NNs to compute (even if there are more!). Although a k equal to the number of query observations allows an exhaustive search, it is typically set much lower in order to reduce computational effort.
# get spatial coordinatesxy <-centroids(SpatialData::transform(shape(sd)))# get cells in radial neighborhoodnn <-nn2(xy, searchtype="radius", r=r <-100, k=k <-51) is <- nn$nn.idx[, -1] # exclude selfns <-rowSums(is >0) # number of NNssummary(ns)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 9.00 13.00 15.38 23.00 46.00
Here, we find that cells have 0-46 neighbors (15 on average) within a radius of 100 units. This seems adequeate enough to capture local neighborhoods without being too large.
More about nearest neighbors
The number of NNs within a given distance (or, the distance yielding a given number of NNs) can vary greatly between tissue types. It is generally a good idea to perform a NNs search with a reasonably large k, and to inspect the number of NNs across cells (e.g., hist(rowSums(...$nn.idx > 0))). When data are large (millions of cells), this could use a few representative sections.
Then, k should be set such that this distribution tails off, i.e., there are few if any cells with more neighbors (or, r should be made smaller). Non-quantitatively speaking, we’d like to have enough NNs for quantification, but they should fall close enough to be meaningful (e.g., paracrine signaling).
hist(ns, breaks=seq(0, k+1)-0.5, col="steelblue", border="lavender", main="", xlab="# NNs within 100 units distance", ylab="# cells")
We can now quantify the relative subpopulation composition around each cell, and summarize this at the cluster level to get a sense of which cell types tend to attract/avoid one another.
# quantify cell type frequencies in neighborhoodis[is ==0] <-NAks <-factor(se$transferred_cell_type)mx <-matrix(ks[is], nrow=ncol(se))df <-lapply(levels(ks), \(k) { p <-rowMeans(mx == k, na.rm=TRUE)data.frame(i=ks, j=k, p)}) |>do.call(what=rbind)head(df)
i j p
1 T B 0.07692308
2 T B 0.06451613
3 dendritic B 0.07142857
4 T B 0.03448276
5 T B 0.11111111
6 T B 0.10344828
We can observe ~3 colocalization modules, namely: an immune module (T, B, DC), a tumor/stromal module (DCIS1/2, tumor, myoepithelia), and a stromal/myeloid module (mast, macrophages, endothelial -EC- and perivascular -PV- cells). Within this latter module, EC/PV are especially attraced to one another.
EC/PV attraction is consistent with the fact that EC line the interior of blood vessels, while PV (pericytes and smooth muscle cells) envelop them. Together, they form a functional unit for tissue homeostasis that provides stability and helps regulate blood vessle formation as well as permability.
Code
# score colocalization between cell types# (average & z-scale by target cell type)fd <- df |>filter(!is.na(i)) |>group_by(i, j) |>summarise_at("p", mean, na.rm=TRUE) |>group_by(j) |>mutate_at("p", base::scale)# hierarchical clustering of cell types # based on colocalization patternsmx <-pivot_wider(fd, names_from="i", values_from="p")hc <-hclust(dist(as.matrix(mx[, -1])))xo <- mx$j[hc$order]ggplot(fd, aes(i, j, fill=p)) +geom_tile() +scale_fill_gradient2(NULL, low="blue", high="red") +ggtitle(paste0("z-scaled frequency of X\n","in radial neighborhood Y")) +scale_y_discrete(limits=xo) +scale_x_discrete(limits=xo) +coord_equal(expand=FALSE) +theme_bw() +theme(axis.title=element_blank(),legend.key.height=unit(1, "lines"),legend.key.width=unit(.5, "lines"),plot.title=element_text(hjust=0.5),axis.text.x=element_text(angle=45, hjust=1))
Structures
The R/Bioconductor package sosta(Gunz, Crowell, and Robinson 2025) provides tools to identify and quantitatively characterize multi-cellular anatomical structures by modeling cells as a ‘marker’ point process, and delineating spatial structures based on a threshold on cellular ‘intensity’.
Structures could, for example, represent germinal centers (GC) in tonsil, metastases or tertiary lymphoid structures (TLS) in cancer, etc.
library(sosta)library(spatstat.geom)
We’ll start by construction a point pattern process (ppp) of cell centroids, using cluster assignments as marks. Note that we first transform boundaries to align with the coordinate space shared between all elements; alternatively, this could be done after reconstruction of spatial structures.
We can now use cellular density (more precisely, the ppp’s intensity) to identify multi-cellular anatomical substructures using sosta; we’ll then assign these to our object as a new shape element. Here, we based structure calling on malignant subpopulations:
DCIS = ductal carcinoma in situ is a very early form of breast cancer.
# reconstruct ROIs based on cellular densityroi <-reconstructShapeDensity(pp, markSelect=use, dim=400)st_geometry(roi) <-"geometry"roi <-SpatialDataShape(roi)shape(sd, "roi") <- roi
Next, we can use regions to aggregate cell-level measurements into pseudo-bulks using mask() (argument how specifies how to do that). Masking will generate an additional genes \(\times\) regions table in which cell metadata (colData) i/n_instances are the indices/number of cells contributing to each aggregate, and the first observation (column "0") captures data on out-of-region cells.
Structures comprise 1-1079 cells (115 on average), whereas 10860 cells have not been assigned to any region:
ns <- pb$n_instancesrange(ns[-1]) # per region
[1] 1 1079
mean(ns[-1]) # on average
[1] 114.9623
ns[1] # unassigned
[1] 10860
While sosta has identified 53 regions, we’ll filter out those comprising fewer than 100 cells in order to rid ourselves of spurious findings:
ncol(pb <- pb[, pb$n_instances >=100])
[1] 13
Since we have tracked the indices of cells falling within each region, we can also inspect the composition of these regions in terms of cell types; here, using transferred_cell_type labels rather than Leiden clusters:
In a multi-section setting, one could compare structure-level pseudobulks across sections, e.g., in terms of differential expression or cell type abundances (T cell exhaustion, immune infiltration, etc.).
In conclusing, we’ll briefly investigate the principle components (PCs) of region-level pseudobulks. This allows us to highlight features that drive transcriptional variability between them; note that these are necessarily also affected by compositional differences between regions in terms of cell types.
# library-size normalizationpb <-normalizeRnaCounts.se(pb, log=FALSE, assay.type="sum",output.name="normcounts")# principal component analysis (PCA)pc <-runPca(assay(pb, "normcounts"))# top-loaded features for PCs 1-3apply(pc$rotation[, 1:3], 2, \(.) { . <-sort(abs(.), decreasing=TRUE)head(names(.), 10)})
AI tells us “X-box binding protein 1 (XBP1) is a key transcription factor, activated by the IRE1/XBP1 branch of the Unfolded Protein Response (UPR), which promotes breast cancer development, progression, and resistance to therapy.”; so let’s believe that makes sense, be proud, and move on.
Gunz, Samuel, Helena Lucia Crowell, and Mark D Robinson. 2025. “Analysis of anatomical multi-cellular structures from spatial omics data using sosta.”bioRxiv, 2025.10.13.682065. https://doi.org/10.1101/2025.10.13.682065.
Janesick, Amanda, Robert Shelansky, Andrew D Gottscho, Florian Wagner, Stephen R Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. “High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis.”Nature Communications 14 (1): 8353. https://doi.org/10.1038/s41467-023-43458-x.