CSAMA

In this demo, we’ll be analyzing 5K-plex Xenium (10x Genomics) data on a tissue section from a human breast cancer biopsy, derived from here. These include the following elements:

  • two images (histology and immunofluorescence)
  • two shapes (nuclei and membrane boundaries)
  • one table (gene \(\times\) cell data)
  • one point (RNA transcripts)

Preamble

Dataset

A size-reduced example dataset has been deposited on Zenodo. We’ll retrieve the compressed SpatialData .zarr store together with a reference scRNA-seq dataset in .h5ad format (Janesick et al. 2023), decompress both archives, and stash their on-disk file paths for later:

# scRNA-seq data (.h5ad)
url_h5 <- file.path(
    "https://zenodo.org/records/20178278/files",
    "BC_scRNAseq_subset.h5ad.zip?download=1")

# retrieve
zip_h5 <- tempfile(fileext=".h5ad.zip")
download.file(url_h5, zip_h5, quiet=TRUE)

# decompress
fnm_h5 <- unzip(zip_h5, exdir=dirname(zip_h5))
h5 <- grep("h5ad$", fnm_h5, value=TRUE)
cat(basename(h5))
BC_scRNAseq_subset.h5ad
# SpatialData store (.zarr)
url_zs <- file.path(
    "https://zenodo.org/records/20180722/files",
    "BC_10x_xenium_sdata.zarr.zip?download=1")

# retrieve
zip_zs <- tempfile(fileext=".zarr.zip")
download.file(url_zs, zip_zs, quiet=TRUE)

# decompress
fnm_zs <- unzip(zip_zs, exdir=dirname(zip_zs))
zs <- grep("zarr$", dirname(fnm_zs), value=TRUE)
cat(basename(zs))
BC_10x_xenium_sdata.zarr
├── images
│   ├── he_image
│   │   ├── s0
│   │   ├── s1
│   │   ├── s2
│   │   ├── s3
│   │   └── zarr.json
│   ├── morphology_focus
│   │   ├── s0
│   │   ├── s1
│   │   ├── s2
│   │   ├── s3
│   │   └── zarr.json
│   └── zarr.json
├── points
│   ├── transcripts
│   │   ├── points.parquet
│   │   └── zarr.json
│   └── zarr.json
├── shapes
│   ├── cell_boundaries
│   │   ├── shapes.parquet
│   │   └── zarr.json
│   ├── nucleus_boundaries
│   │   ├── shapes.parquet
│   │   └── zarr.json
│   └── zarr.json
├── tables
│   ├── table
│   │   ├── X
│   │   ├── layers
│   │   ├── obs
│   │   ├── obsm
│   │   ├── obsp
│   │   ├── raw
│   │   ├── uns
│   │   ├── var
│   │   ├── varm
│   │   ├── varp
│   │   └── zarr.json
│   └── zarr.json
└── zarr.json

Libraries

Throughout this demo, we’ll require the following R dependencies; these expect R v4.6.0 and Bioconductor v3.24 (devel/pre-release).

library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)
library(anndataR)
library(scrapper)
library(patchwork)
library(ggnewscale)
library(reticulate)
library(SpatialData)
library(SpatialData.data)
library(SpatialData.plot)
library(SingleCellExperiment)

Visualization

Let’s load up the SpatialData object in R.

(sd <- readSpatialData(zs))
class: SpatialData
- images(2):
  - he_image (3,2141,1822)
  - morphology_focus (4,2111,1787)
- labels(0):
- points(1):
  - transcripts (2978451)
- shapes(2):
  - cell_boundaries (18626,circle)
  - nucleus_boundaries (18389,circle)
- tables(1):
  - table (5101,18626) [cell_boundaries]
coordinate systems(1):
- global(5): he_image morphology_focus cell_boundaries
  nucleus_boundaries transcripts

Histopathology

Let’s start out by visualization the hematoxylin and eosin (H&E) staining. H&E provides a high-contrast visualization of tissue microarchitecture and cellular morphology through differential staining of nuclear and cytoplasmic components. To date, whole-slide H&E images (WSIs) remain the clinical gold standard for cancer diagnosis, including tumor grading, staging, and prognosis.

plotSpatialData() + plotImage(sd)

We can also zoom into a particular region by specifying a bounding box and using crop() to subset the data accordingly across all layers:

bb <- list(
    xmin=15e3, xmax=18e3, 
    ymin=44e3, ymax=46e3)
sp <- crop(sd, bb)
plotSpatialData() + plotImage(sp) 

Immunofluorescence

In addition to the H&E, the dataset includes 4 immunofluorescence (IF) markers:

  • DAPI (nuclei) marks all cell nuclei and is used for cell segmentation
  • ATP1A1/CD45/E-Cadherin (membrane) marks cell boundaries and major lineages
  • 18S (cytoplasm) marks the cell body/cytoplasmic RNA and overall cell extent
  • αSMA/Vimentin (stroma) marks mesenchymal/stromal tissue, including fibroblasts

These channels are part of the 10x segmentation kit and are used in the multimodal cell segmentation workflow.
img <- image(sd, 2)
as.vector(channels(img))
[1] "DAPI"                   "ATP1A1/CD45/E-Cadherin" "18S"                   
[4] "AlphaSMA/Vimentin"     

The underlying image is a 3D array where the first dimension are channels, and the second and third dimensions are height and width; we can verify this by viewing the axes() specified in the .zattrs associated with the image:

do.call(rbind, axes(img))
     name type     
[1,] "c"  "channel"
[2,] "y"  "space"  
[3,] "x"  "space"  
cat(dim(img))
4 2111 1787

Let’s visualize the four-plex composite image overlaying these markers; for consistent visualization, we also specify colors to use for each channel:

pal <- c("blue", "green", "cyan", "magenta")
chs <- names(pal) <- channels(img)
plotSpatialData() + plotImage(sd, 2, ch=chs, c=pal[chs])

Alternatively, we can visualize each channel separately; here, using the smaller region define above.

Note that plotImage() will auto-adjust contrast limits for each channel when rendering images. Channels might hence appear differently when visualized jointly/separately.
# plot channels one by one
lapply(chs, \(.) {
    plotSpatialData() + ggtitle(.) +
    plotImage(sp, i=2, ch=., c=pal[.])
}) |>
# arrange in 2x2 grid
wrap_plots(nrow=2) &
    theme_void() & theme(
        legend.position="none",
        plot.title=element_text(hjust=0.5))

Segmentation

We can also zoom in further to inspect how well cell segmentation boundaries align with histopathology and the immunofluorescent signals used to get them:

# crop to 2,000 x 3,000 bounding box
bb <- list(
    xmin=15e3, xmax=17e3,
    ymin=42e3, ymax=45e3)
sq <- crop(sd, bb)
# visualize HE & IF side-by-side
plotSpatialData() + 
    plotImage(sq, 1) +
    plotShape(sq, fill=NA, col="black", linewidth=1/5) +
plotSpatialData() + 
    plotImage(sq, 2, ch=chs, c=pal[chs]) +
    plotShape(sq, fill=NA, col="white", linewidth=1/5)

We can also go a step further, visualizing both cell nuclei (in gold) and membrane (in grey) boundaries togeher on an ever smaller region:

# crop to 1,000 x 1,000 bounding box
bb <- list(
    xmin=15e3, xmax=16e3,
    ymin=43e3, ymax=44e3)
sq <- crop(sd, bb)
# visualize IF, membrane & nuclei boundaries
plotSpatialData() + 
    plotImage(sq, 2, ch=chs, c=pal[chs]) +
    plotShape(sq, 1, fill=NA, col="grey", linewidth=1/3) +
    plotShape(sq, 2, fill=NA, col="gold", linewidth=1/3)

Transcripts

Going all the way, let’s zoom in to a couple cells in order to visualize 4,000 individual transcript molecules as well; we’ll color these by whether or not they have been determined to overlaps_nucleus (according to 10x):

# crop to 600 x 600 bounding box
bb <- list(
    xmin=15200, xmax=15800,
    ymin=43200, ymax=43800)
sq <- crop(sd, bb)

# 'overlaps_nucleus' is 0/1 (no/yes);
# convert to logical for discrete coloring
point(sq) <- mutate(point(sq), overlaps_nucleus=as.logical(overlaps_nucleus))

# visualiye IF, boundaries & transcripts
plotSpatialData() + 
    plotImage(sq, 2, ch=chs, c=pal[chs]) +
    plotShape(sq, 1, fill=NA, col="grey", linewidth=1/3) +
    plotShape(sq, 2, fill=NA, col="gold", linewidth=1/3) +
    plotPoint(sq, n=4e3, col="overlaps_nucleus", size=0) +
    guides(col=guide_legend(override.aes=list(size=2))) +
    scale_color_manual(values=c("grey", "red"))

Going back to a larger scale, let’s finish with an aesthetically pleasing plot:

Code
bb <- list(xmin=16e3, xmax=18e3, ymin=45e3, ymax=46e3)
sq <- crop(sd[c("images", "points"), list(2, 1)], bb)
plotSpatialData() + 
    plotImage(sq, ch="DAPI", c="purple") +
    plotPoint(sq, n=8e3, col="ivory", size=0) +
    theme_void() + theme(legend.position="none")

Quality Control

Visual inspection is essential. Systematic assessment of transcript-to-cell assignments would, however, require inspection of transcript localization in 3D across thousands of genes, which is infeasible for modern high-plex panels. SegTraQ provides a collection of metrics for Segmentation and Transcript Assignment Quality Control.

While SegTraQ is implemented in Python, reticulate and Quarto enable us to execute both programming languages in one session. We can thus compute QC metrics in Python and subsequently switch back to R.

Python setup

We can instruct our R session to use an existing Python environment with our required dependencies (here, SegTraQ) via reticulate. Below we demonstrate two common approaches:

  • a conda/mamba environment
  • a lightweight uv virtual environment

We can instruct reticulate to use an existing conda/mamba environment via:

# path to conda binary
bin <- "~/software/mambaforge/bin/conda" 
options(reticulate.conda_binary=bin)

# activate environment
use_condaenv("segtraq")

This environment was created previously from the command line via:

conda create -n segtraq python=3.11
conda activate segtraq
pip install segtraq

Alternatively, we can use a local virtual environment created with uv (install via brew install uv) and point reticulate directly to the corresponding Python executable:

venv_path <- file.path(getwd(), ".venv/bin/python")
use_python(venv_path, required=TRUE)

This environment was created previously from the command line via:

uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install segtraq

Once configured, required Python modules can be imported using Python syntax:

import segtraq
import anndata as ad
import spatialdata_plot
import spatialdata as sd
import matplotlib.pyplot as plt

Running SegTraQ

We first load the SpatialData object from the .zarr store in Python:

sdata = sd.read_zarr(r.zs)
sdata
SpatialData object, with associated Zarr store: /private/var/folders/6b/rqd50fp900g27n53gnfr_2800000gn/T/RtmpYplW2d/BC_10x_xenium_sdata.zarr
├── Images
│     ├── 'he_image': DataTree[cyx] (3, 2141, 1822), (3, 1070, 911), (3, 535, 455), (3, 267, 227)
│     └── 'morphology_focus': DataTree[cyx] (4, 2111, 1787), (4, 1055, 893), (4, 527, 446), (4, 263, 223)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 13) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (18626, 1) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (18389, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (18626, 5101)
with coordinate systems:
    ▸ 'global', with elements:
        he_image (Images), morphology_focus (Images), transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes)

SegTraQ organizes metrics into six modules. Here, we begin with the baseline (bl) module, which computes a set of per-cell and -sample summary statistics such as transcript counts, detected genes, cell area, and gene-wise fraction of unassigned transcripts. Metrics can be computed either individually (e.g., st.bl.transcripts_per_cell()), or jointly by running the full bl module with st.run_baseline().

st = segtraq.SegTraQ(sdata,
    tables_centroid_x_key="centroid_x",
    tables_centroid_y_key="centroid_y",
    images_key="morphology_focus")

st.run_baseline()

Of the six SegTraQ modules, only the supervised (sp) module requires scRNA-seq reference data. Label transfer can either be performed using a lightweight SegTraQ helper, or can be supplied by the user directly. Here, we use the scRNA-seq reference from Janesick et al. (2023) retrieved above and transfer labels to the spatial data using the SegTraQ helper.

ref = ad.read_h5ad(r.h5)

st.run_label_transfer(
    ref, 
    ref_cell_type="celltype", 
    ref_ensemble_key=None, 
    query_ensemble_key=None)

We use this reference to derive positive and negative cell type markers – i.e., genes expected to be present or absent in a given cell type – and pass these to the sp module in order to identify potentially contaminated cells. For demonstration, we compute two metrics from the sp module.

# subset to common genes
adata = st.sdata.tables["table"]
common_genes = adata.var_names[adata.var_names.isin(ref.var_names)]
ref = ref[:, common_genes].copy()

# get marker genes
markers = segtraq.markers_from_reference(
    ref,
    cell_type_key="celltype",
    n_jobs=16)

# compute exemplary metrics
_ = st.sp.neighbor_contamination(
    markers=markers, 
    cell_type_key="transferred_cell_type")
INFO     Creating graph using `generic` coordinates and `None` transform and `1`
         libraries.                                                             

_ = st.sp.marker_purity(
    markers=markers,
    cell_type_key="transferred_cell_type")

In the absence of reference scRNA-seq data, SegTraQ provides the region similarity (rs) module that identifies contaminated cells based on the expression similarity between subcellular regions and their local neighborhood. For demonstration, we here view one exemplary metric from the rs module.

st.rs.similarity_border_neighborhood().head()
      cell_id  similarity_border_neighborhood
0  acceakgj-1                        0.103752
1  accebped-1                        0.127066
2  accedpdh-1                        0.043554
3  acceejoe-1                        0.124630
4  acceekkh-1                        0.126224

We can visualize the H&E image, the immunofluorescence channel DAPI, as well as cell boundaries colored by the transferred cell type label or SegTraQ metrics.

Code
# define color map
col_celltype = {
    "T": "#fb8072",
    "B": "#bc80bd",
    "macro": "#910290",
    "dendritic": "#fdb462",
    "mast": "#959059",
    "perivas": "#fed9a6",
    "endo": "#a6cee3",
    "myoepi": "#2782bb",
    "DCIS1": "#3c7761",
    "DCIS2": "#66a61e",
    "tumor": "#66c2a5",
    "stromal": "#d45943",
    "nan": "#808080"
}

# compute per-cell color
labels = sdata.tables["table"].obs["transferred_cell_type"].unique().astype(str).tolist()
cols = [col_celltype[lab] for lab in labels]

fg, ax = plt.subplots(2, 3, figsize=(14, 8), constrained_layout=True)
fg.subplots_adjust(wspace=0.5)

# plot histopathology
st.sdata.pl.render_images(
    "he_image"
).pl.show(
    ax=ax[0, 0], 
    title="H&E")

# plot immunofluorescence
st.sdata.pl.render_images(
    "morphology_focus", 
    channel="DAPI"
).pl.show(
    ax=ax[0, 1], 
    title="DAPI")

# plot cell boundaries colored by cell type
st.sdata.pl.render_shapes(
    element="cell_boundaries",
    color="transferred_cell_type",
    palette=cols,
    groups=labels,
    method="matplotlib"
).pl.show(
    ax=ax[0, 2], 
    colorbar=True,
    title="Cell type")

# plot cell boundaries colored by elongation
st.sdata.pl.render_shapes(
    element="cell_boundaries",
    color="elongation",
    method="matplotlib"
).pl.show(
    ax=ax[1, 0], 
    colorbar=True,
    title="Elongation")

# plot cell boundaries colored by transcript count
st.sdata.pl.render_shapes(
    element="cell_boundaries",
    color="transcript_count",
    method="matplotlib"
).pl.show(
    ax=ax[1, 1], 
    colorbar=True,
    title="Transcript count")
    
# plot cell boundaries colored by negative marker counts
st.sdata.pl.render_shapes(
    element="cell_boundaries",
    color="negative_marker_contamination_counts",
    method="matplotlib"
).pl.show(
    ax=ax[1, 2], 
    colorbar=True,
    title="Negative marker (contamination) count")
WARNING  render_shapes: Found 3591 NaN values in color data. These observations 
         will be colored with the 'na_color'.                                   

After computing required QC metrics in Python, we’ll move back to R to explore and visualize these in more detail.

Back to R

Let’s retrieve the table – including SegTraQ results – from Python, and transfer it to the SpatialData object in R. We set the table to annotate cell boundaries using setTable(); here, ik specifies the spatialdata’s instance_key used to link observations between table and spatial element.

# set Python table to annotate 
# cell segmentation boundaries
se <- py$adata$as_SingleCellExperiment()

# remove old table
sd <- sd[c("images", "points", "shapes")]
(sd <- setTable(sd, y=se, 
    i="cell_boundaries", 
    ik="cell_id"))
class: SpatialData
- images(2):
  - he_image (3,2141,1822)
  - morphology_focus (4,2111,1787)
- labels(0):
- points(1):
  - transcripts (2978451)
- shapes(2):
  - cell_boundaries (18626,circle)
  - nucleus_boundaries (18389,circle)
- tables(1):
  - cell_boundaries_table (5101,18626) [cell_boundaries]
coordinate systems(1):
- global(5): he_image morphology_focus cell_boundaries
  nucleus_boundaries transcripts

Let’s being by recreating the Python spatialdata_plot from above in R; we can also retrieve the palette used for coloring cell types in Python:

# get color mapping used in Python
pal <- py$col_celltype
Code
# base plot wrapper
.p <- \(x, y=FALSE) {
    plotSpatialData() + ggtitle(x) +
    theme(axis.title=element_blank()) +
    if (y) scale_fill_viridis_c(NULL, na.value="#808080")
}

p1 <- .p("H&E") + plotImage(sd)
p2 <- .p("DAPI") + plotImage(sd, 2, ch="DAPI", c="yellow")

p3 <- .p("cell type") + 
    plotShape(sd, col=NA, fill="transferred_cell_type") +
    scale_fill_manual(values=pal, na.value="#808080")

p4 <- .p("elongation", TRUE) + plotShape(sd, col=NA, fill="elongation")
p5 <- .p("# transcripts", TRUE) + plotShape(sd, col=NA, fill="transcript_count")
p6 <- .p("contamination", TRUE) + plotShape(sd, col=NA, fill="negative_marker_contamination_counts")

ps <- list(p1, p2, p3, p4, p5, p6)

wrap_plots(ps, nrow=2) & theme(
    legend.justification=c(0, 0.5),
    plot.margin=margin(2,2, unit="mm"),
    plot.title=element_text(hjust=0.5),
    axis.text.x=element_text(angle=30, hjust=1))

Now, let’s look at a summary of baseline metrics:

Code
se <- table(sd)

n_gs <- mean(se$gene_count, na.rm=TRUE)
n_tx <- mean(se$transcript_count, na.rm=TRUE)
p_tx <- metadata(se)$perc_unassigned_transcripts

data.frame(
    metric=c("# cells", "μ(# genes)", "μ(# transcripts)", "% unassigned"),
    value=round(c(ncol(se), n_gs, n_tx, p_tx), 2))
            metric    value
1          # cells 18626.00
2       μ(# genes)    99.91
3 μ(# transcripts)   145.09
4     % unassigned     9.27
  • Around 9.3% of transcripts were not assigned to any cell.
  • The mean transcript count per cell is 145.1.
  • The mean gene count per cell is 99.9.

It is also relevant to inspect the mean transcript count per gene per cell. We’ll define a wrapper .boxplot to plot features per cell type as boxplots:

.boxplot <- \(y) {
    df <- colData(se) |>
        as.data.frame() |>
        # simplify naming
        dplyr::rename(x=transferred_cell_type) |>
        # exclude unassigned
        filter(!is.na(x), !is.na(.data[[y]])) |>
        # order ascendingly
        mutate(x=reorder(x, .data[[y]], FUN=median, na.rm=TRUE))
    # visualize 'y' across cell types
    ggplot(df, aes(.data[[y]], x, fill=x)) +
        geom_boxplot(outlier.size=0.4) +
        scale_fill_manual(values=pal) +
        ggtitle(y) + theme_bw() + theme(
            aspect.ratio=1,
            legend.position="none",
            axis.title=element_blank(),
            plot.title=element_text(hjust=0.5))
}
  • Stromal cells show higher elongation, whereas T and B cells tend to have less elongated shapes.
  • On median, cells express only 1–2 transcripts per gene.
  • We can also inspect the transcript_density, i.e., transcript abundance realtive to cellular area. Some workflows (see below) recommend filtering cells based on transcript density. However, this can bias filtering towards specific cell types, such as macrophages, mast cells, or dendritic cells, which tend to be large but exhibit comparatively low transcript counts.
.boxplot("mean_transcripts_per_gene") +
.boxplot("elongation") +
.boxplot("transcript_density") 

After inspecting baseline metrics, we next assess potential contamination. Here, SegTraQ summarizes directed contamination between source and target cell types based on negative marker expression.

  • NA values indicate pairs for which contamination can’t be assessed, either because the cell types are not spatial neighbors, or because they do not share informative positive/negative marker combinations.

  • The dataset contains only few tumor cells (about 0.53%), which would explain why little or no contamination from them is being detected.

  • About 19% of myoepithelial cells appear to be contaminated by DCIS2 cells, which are common spatial neighbors. 3% of T cells are contaminated by B cells.

Code
md <- metadata(table(sd))
mx <- md$negative_marker_contamination_binary

# wrangling
df <- mx |>
    as.data.frame() |>
    tibble::rownames_to_column("source_cell_type") |>
    pivot_longer(
        -source_cell_type,
        names_to="target_cell_type",
        values_to="contamination")

# plotting
ggplot(df, aes(
    target_cell_type, source_cell_type, fill=contamination)) + geom_tile() + 
    scale_fill_gradient(low="white", high="red", na.value="lightgrey") +
    geom_text(
        aes(label=sprintf("%.2f", contamination)), 
        filter(df, !is.na(df$contamination)), size=2) +
    coord_equal(expand=FALSE) +
    theme_bw() + theme(
        panel.grid=element_blank(),
        axis.text.x=element_text(angle=45, hjust=1))

For demonstration, let’s inspect mutually exclusive markers between B and T cells. Here, we recover well-known B cell markers such as MS4A1 and CD19, which are not expected to be expressed in T cells.

mk <- py$markers

B_pos <- unlist(mk$B$positive)
T_neg <- unlist(mk$T$negative)

sort(intersect(B_pos, T_neg))
 [1] "ADAM28"    "AIM2"      "BANK1"     "BLK"       "BLNK"      "CD180"    
 [7] "CD19"      "CD22"      "CD38"      "CD40"      "CD79A"     "CD79B"    
[13] "CDK14"     "CNR2"      "COL4A3"    "COL4A4"    "CR1"       "DERL3"    
[19] "FCRL1"     "FCRL2"     "FCRL5"     "FCRLA"     "IRF4"      "KCNN4"    
[25] "LRRK2"     "MS4A1"     "NCF1"      "NIBAN3"    "PAX5"      "PLCG2"    
[31] "POU2AF1"   "RELT"      "SCIMP"     "SPIB"      "STAP1"     "TENT5C"   
[37] "TLR1"      "TLR6"      "TNFRSF13B" "TNFRSF13C" "TNFRSF17" 

Let’s visualize the expression of MS4A1 in T and B cells. To this end, we crop the dataset to a region with a high density of these cell types. Visually, MS4A1 transcripts appear closer to the cell membrane in T cells than in B cells, which would be consistent with local transcript spillover.

# crop to 2,000 x 2,000 bounding box
bb <- list(
    xmin=14e3, xmax=16e3,
    ymin=45e3, ymax=47e3)
sq <- crop(sd, bb)

# visualize MS4A1 transcripts over labeled cells
plotSpatialData() +
    plotShape(sq, col=NA, fill="transferred_cell_type", alpha=0.8) +
    scale_fill_manual(values=pal, na.value="#808080") +
    plotPoint(sq, key="MS4A1", size=0.4) +
    ggtitle("MS4A1 transcripts in T/B cells")

Let’s evaluate this with SegTraQ, which provides the point statistics (ps) module to quantify subcellular transcript distributions within cells. For example, we can compute transcript distances to the cell membrane either across all genes and cells or for selected genes and cell types. For this, let’s briefly jump back into Python.

_ = st.ps.distance_to_membrane(
    genes=["MS4A1"], 
    cell_type_key="transferred_cell_type",
    cell_type_query=["B", "T"])

Let’s jump back to R and plot the distance of MS4A1 in B and T cells. Indeed, MS4A1 is closer to the cell membrane in T than in B cells. This aligns with these signals arising from spillover rather than true biological expression:

se <- py$adata$as_SingleCellExperiment()
sd <- setTable(sd, y=se, 
    i="cell_boundaries", 
    ik="cell_id")
Code
df <- colData(se) |>
    as.data.frame() |>
    filter(
        transferred_cell_type %in% c("B", "T"),
        !is.na(distance_to_cell_membrane_norm_MS4A1))

ggplot(df, aes(
    transferred_cell_type,
    distance_to_cell_membrane_norm_MS4A1,
    fill=transferred_cell_type)) +
    geom_boxplot(outlier.size=1/3, alpha=1/3) +
    geom_jitter(
        aes(color=transferred_cell_type),
        width=1/5, size=1/3, alpha=2/3) +
    scale_fill_manual(values=pal) +
    scale_color_manual(values=pal) +
    theme_bw() + theme(
        aspect.ratio=3/2,
        legend.position="none",
        axis.title=element_blank(),
        plot.title=element_text(hjust=0.5)) +
    ggtitle("distance of MS4A1\nto cell membrane") 

SegTraQ can support the identification and filtering of low-quality cells or samples based on morphological abnormalities, low transcript counts, or increased contamination. It can also be used to benchmark segmentation methods and help identify the most suitable segmentation strategy for a given dataset.

To explore additional SegTraQ metrics, please refer to the SegTraQ documentation. SegTraQ is still under active development; additional metrics and improvements will keep being introduced in future releases.

Processing

The SpatialData object contains a single table element – represented as a SingleCellExperiment – annotating cell_boundaries. This table contains the single-cell level data: a gene \(\times\) cell count matrix (assay slot "X"), gene/cell metadata (row/colData), as well as spatial coordinates (reducedDim slot "spatial").

(se <- table(sd))
class: SingleCellExperiment 
dim: 5101 18626 
metadata(8): spatialdata_attrs num_cells ...
  negative_marker_contamination negative_marker_contamination_binary
assays(1): X
rownames(5101): A2ML1 AAMP ... ZYG11B ZYX
rowData names(6): gene_ids feature_types ... unassigned perc_unassigned
colnames(18626): 6430 6431 ... 687177 687178
colData names(43): cell_id transcript_counts ...
  distance_to_cell_membrane_norm_MS4A1
  distance_to_cell_membrane_inverse_MS4A1
reducedDimNames(1): spatial
mainExpName: NULL
altExpNames(0):

Filtering

Because data are represented as a SCE, we can perform standard processing tasks using tools available in R/Bioc (e.g., scrapper) such as filtering little informative cells based on, e.g., transcript_density:

# exclude cells <2 MADs in log-counts per area
cpa <- log(se$transcript_counts/se$cell_area)
mad <- mad(cpa, na.rm=TRUE)
med <- median(cpa, na.rm=TRUE)
se$ex <- is.na(cpa) | (cpa < med-2*mad) 
Code
plotSpatialData() + 
    plotShape(`table<-`(sd, value=se), col=NA, fill="ex") + 
    scale_fill_manual("exclude", values=c("grey", "red"))

We’ll here filter out cells with comparatively low transcript density, and proceed with processing steps standard for scRNA-seq data, namely:

  • log-library size normalization
  • highly variable gene (HVG) selection
  • principal component analysis (PCA)
  • shared-nearest neighbor (SNN) graph construction
  • community detection using the Leiden algorithm
base::table(se$ex)

FALSE  TRUE 
16953  1673 
se <- se[, !se$ex]
assayNames(se) <- "counts"
se <- normalizeRnaCounts.se(se)
se <- chooseRnaHvgs.se(se)
se <- runPca.se(se, features=rowData(se)$hvg)
se <- clusterGraph.se(se, method="leiden", resolution=0.5)
nk <- nlevels(se$clusters)
table(sd) <- se

Let’s visualize the resulting cluster assignments spatially; from this, we can observe that malignant regions are mostly made up of a single (large) cluster:

plotSpatialData() + 
    plotShape(sd, col=NA, fill="clusters") +
    scale_fill_manual(values=hcl.colors(nk, "Spectral")) 

Downstream

Structures

The R/Bioconductor package sosta (Gunz, Crowell, and Robinson 2025) provides tools to identify and quantitatively characterize multi-cellular anatomical structures by modeling cells as a ‘marker’ point process, and delineating spatial structures based on a threshold on cellular ‘intensity’.

Structures could, for example, represent germinal centers (GC) in tonsil, metastases or tertiary lymphoid structures (TLS) in cancer, etc.
library(sosta)
library(spatstat.geom)

We’ll start by construction a point pattern process (ppp) of cell centroids, using cluster assignments as marks. Note that we first transform boundaries to align with the coordinate space shared between all elements; alternatively, this could be done after reconstruction of spatial structures.

xy <- centroids(SpatialData::transform(shape(sd)))
ow <- owin(xrange=range(xy[, 1]), yrange=range(xy[, 2]))
pp <- ppp(x=xy[, 1], y=xy[, 2], window=ow, marks=se$clusters)

We can now use cellular density (more precisely, the ppp’s intensity) to identify multi-cellular anatomical substructures using sosta; we’ll then assign these to our object as a new shape element:

use <- which.max(table(se$clusters))
roi <- reconstructShapeDensity(pp, markSelect=use, dim=400)
st_geometry(roi) <- "geometry"
roi <- SpatialDataShape(roi)
shape(sd, "roi") <- roi
Code
plotSpatialData() + 
    plotShape(sd, col=NA, fill="clusters") + 
    plotShape(sd, "roi", col="red", fill=NA, linewidth=1) +
    scale_fill_manual(values=hcl.colors(nk, "Spectral")) 

Next, we can use regions to aggregate cell-level measurements into pseudo-bulks using mask() (argument how specifies how to do that). Masking will generate an additional genes \(\times\) regions table in which cell metadata (colData) i/n_instances are the indices/number of cells contributing to each aggregate, and the first observation (column "0") captures data on out-of-region cells.

sd <- mask(sd, i="cell_boundaries", j="roi", how="sum")
(pb <- table(sd, "cell_boundaries_by_roi"))
class: SingleCellExperiment 
dim: 5101 35 
metadata(0):
assays(1): sum
rownames(5101): A2ML1 AAMP ... ZYG11B ZYX
rowData names(0):
colnames(35): 0 1 ... 33 34
colData names(2): i_instances n_instances
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Structures comprise 1-932 cells (144 on average), whereas 12058 cells have not been assigned to any region:

ns <- pb$n_instances
range(ns[-1]) # per region
[1]   1 932
mean(ns[-1])  # on average
[1] 143.9706
ns[1]         # unassigned
[1] 12058

While sosta has identified 34 regions, we’ll filter those that comprise at least 10 cells in order to rid ourselves of spurious findings:

pb <- pb[, pb$n_instances >= 10]

Since we have tracked the indices of cells falling within each region, we can also inspect the composition of these regions in terms of cell types; here, using transferred_cell_type labels rather than Leiden clusters:

In a multi-section setting, one could for example compare structure-level pseudobulks across sections, in terms of differential expression or abundance of cell types (T cell exhaustion, immune infiltration, etc.).

Code
# count clusters by region
is <- pb$i_instances[-1]            # cell indices by region
ks <- se$transferred_cell_type      # cell type assignments
ns <- sapply(is, \(i) table(ks[i]))

# wrangling
df <- as.data.frame(ns) |>
    tibble::rownames_to_column("cluster") |>
    pivot_longer(-cluster, names_to="region", values_to="n_cells")

# plotting
p0 <- ggplot(df, aes(
    reorder(region, n_cells),
    x=n_cells, fill=cluster))
p1 <- p0 +  
    geom_col(position="identity") + 
    labs(x="# cells", y="region") 
p2 <- p0 +  
    geom_col(position="fill") + 
    labs(x="% cells", y="region") 

# arrangement
wrap_plots(p1, p2, guides="collect") &
    scale_fill_manual(values=pal) &
    coord_cartesian(expand=FALSE) &
    theme_bw() & theme(
        aspect.ratio=3/2,
        panel.grid.minor=element_blank(),
        panel.grid.major.y=element_blank(),
        legend.key.size=unit(2/3, "lines"))

pb <- normalizeRnaCounts.se(pb, 
    log=FALSE, 
    assay.type="sum",
    output.name="normcounts")
pc <- runPca(assay(pb, "normcounts"))
# top-loaded features for PCs 1-3
apply(pc$rotation[, 1:3], 2, \(.) {
    . <- sort(abs(.), decreasing=TRUE)
    head(names(.), 10)
})
      [,1]        [,2]      [,3]       
 [1,] "XBP1"      "XBP1"    "RAB11FIP1"
 [2,] "TIMP3"     "CA12"    "EEF1G"    
 [3,] "NPY1R"     "CCND1"   "HSPB1"    
 [4,] "CA12"      "TSPAN13" "TIMP3"    
 [5,] "ANPEP"     "TIMP3"   "XBP1"     
 [6,] "CCND1"     "TCIM"    "TCIM"     
 [7,] "IL6ST"     "PDZK1"   "CA12"     
 [8,] "TCIM"      "GATA3"   "WNT9A"    
 [9,] "TMEM101"   "SEPTIN9" "CACNG4"   
[10,] "RAB11FIP1" "PDCD4"   "ECM1"     
Code
plotSpatialData() + 
    plotShape(sd, col=NA, fill="XBP1", assay="logcounts") +
    scale_fill_gradientn(colors=rev(hcl.colors(9, "Roma")))

AI tells us “X-box binding protein 1 (XBP1) is a key transcription factor, activated by the IRE1/XBP1 branch of the Unfolded Protein Response (UPR), which promotes breast cancer development, progression, and resistance to therapy.”; so let’s believe that and be proud.

appendix

sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: aarch64-apple-darwin23
Running under: macOS Sequoia 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] spatstat.geom_3.7-3         spatstat.univar_3.1-7      
 [3] spatstat.data_3.1-9         sosta_1.5.0                
 [5] SingleCellExperiment_1.35.0 SummarizedExperiment_1.43.0
 [7] Biobase_2.73.1              GenomicRanges_1.65.0       
 [9] Seqinfo_1.3.0               IRanges_2.47.0             
[11] S4Vectors_0.51.1            BiocGenerics_0.59.0        
[13] generics_0.1.4              MatrixGenerics_1.25.0      
[15] matrixStats_1.5.0           SpatialData.plot_0.99.6    
[17] SpatialData.data_0.99.6     SpatialData_0.99.36        
[19] reticulate_1.46.0           ggnewscale_0.5.2           
[21] patchwork_1.3.2             scrapper_1.7.0             
[23] anndataR_1.3.0              ggplot2_4.0.3              
[25] tidyr_1.3.2                 dplyr_1.2.1                
[27] sf_1.1-1                   

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3       rstudioapi_0.18.0        jsonlite_2.0.0          
  [4] wk_0.9.5                 magrittr_2.0.5           spatstat.utils_3.2-3    
  [7] magick_2.9.1             farver_2.1.2             rmarkdown_2.31          
 [10] fs_2.1.0                 vctrs_0.7.3              spatstat.explore_3.8-0  
 [13] RCurl_1.98-1.18          terra_1.9-27             duckspatial_1.0.0       
 [16] htmltools_0.5.9          S4Arrays_1.13.0          curl_7.1.0              
 [19] BiocNeighbors_2.7.0      SparseArray_1.13.2       KernSmooth_2.23-26      
 [22] htmlwidgets_1.6.4        basilisk_1.25.0          httr2_1.2.2             
 [25] uuid_1.2-2               lifecycle_1.0.5          pkgconfig_2.0.3         
 [28] Matrix_1.7-5             R6_2.6.1                 fastmap_1.2.0           
 [31] digest_0.6.39            paws.storage_0.9.0       tensor_1.5.1            
 [34] beachmat_2.29.0          filelock_1.0.3           labeling_0.4.3          
 [37] Rarr_2.1.7               spatstat.sparse_3.1-0    polyclip_1.10-7         
 [40] abind_1.4-8              compiler_4.6.0           proxy_0.4-29            
 [43] withr_3.0.2              S7_0.2.2                 tiff_0.1-12             
 [46] DBI_1.3.0                ggforce_0.5.0            R.utils_2.13.0          
 [49] grumpy_0.1.0             duckdb_1.5.2             MASS_7.3-65             
 [52] rappdirs_0.3.4           DelayedArray_0.39.1      rjson_0.2.23            
 [55] classInt_0.4-11          tools_4.6.0              units_1.0-1             
 [58] otel_0.2.0               goftest_1.2-3            R.oo_1.27.1             
 [61] glue_1.8.1               nlme_3.1-169             EBImage_4.55.0          
 [64] grid_4.6.0               gtable_0.3.6             R.methodsS3_1.8.2       
 [67] class_7.3-23             ZarrArray_1.1.0          XVector_0.53.0          
 [70] pillar_1.11.1            nanoarrow_0.8.0          tweenr_2.0.3            
 [73] lattice_0.22-9           deldir_2.0-4             tidyselect_1.2.1        
 [76] paws.common_0.8.9        RBGL_1.89.0              locfit_1.5-9.12         
 [79] knitr_1.51               xfun_0.57                smoothr_1.3.0           
 [82] fftwtools_0.9-11         yaml_2.3.12              codetools_0.2-20        
 [85] evaluate_1.0.5           tibble_3.3.1             BiocManager_1.30.27     
 [88] graph_1.91.0             cli_3.6.6                dichromat_2.0-0.1       
 [91] Rcpp_1.1.1-1.1           spatstat.random_3.4-5    dir.expiry_1.21.0       
 [94] dbplyr_2.5.2             png_0.1-9                parallel_4.6.0          
 [97] blob_1.3.0               jpeg_0.1-11              bitops_1.0-9            
[100] SpatialExperiment_1.23.0 geoarrow_0.4.2           viridisLite_0.4.3       
[103] scales_1.4.0             e1071_1.7-17             purrr_1.2.2             
[106] crayon_1.5.3             BiocStyle_2.41.0         rlang_1.2.0             

References

Gunz, Samuel, Helena Lucia Crowell, and Mark D Robinson. 2025. Analysis of anatomical multi-cellular structures from spatial omics data using sosta.” bioRxiv, 2025.10.13.682065. https://doi.org/10.1101/2025.10.13.682065.
Janesick, Amanda, Robert Shelansky, Andrew D Gottscho, Florian Wagner, Stephen R Williams, Morgane Rouault, Ghezal Beliakoff, et al. 2023. High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis.” Nature Communications 14 (1): 8353. https://doi.org/10.1038/s41467-023-43458-x.