Experimental design

Table of Contents

- Experimental design
  - Samples
  - The hybrid BAG platform
Clustering RNA and DNA layers
Integrated Multi-omics Reveals Tumor Heterogeneity
- - RNA layer
- Profiling genome and transcriptome of tumor samples using hybrid data
Hybrid DNA-RNA Analysis Reveals Transcriptional Plasticity in Uterine Tumors
Integrated Analysis Reveals Cellular Relationships and Novel Subtypes

Analysis of chromosome X loss in somatic cells of the primary tumor 2

Samples

We obtained fresh tissue samples of primary tumors from five patients with uterine cancer (Additional file 1: Table S1), and in three cases, distal “normal” endometrium (see Additional file 2: Table S2 for a detailed description). The tumor types surveyed include two carcinosarcomas, a serous carcinoma, an endometrioid adenocarcinoma, and a leiomyosarcoma. Each sample was frozen and pulverized into a powder. From this powder, nuclei where extracted for single-cell DNA, single-cell RNA, and single-cell DNA-RNA (“hybrid”) BAG platform. We also used the same source material to perform whole-genome sequencing (WGS). This extensive approach ensured that all types of cells were proportionately represented in each method of analysis.To refine our methodology and study doublet collisions, we mixed powders from different patients prior to single-nucleus sorting, as discussed later. Additional file 2: Table S2 provides a comprehensive overview of the datasets utilized in our study, detailing the combination of sample origin (including unique setups like the mixed powder experiment), the associated protocols, and the respective experimental parameters.

The hybrid BAG platform

Our study uses the BAG platform, a versatile method that captures templates from a single-cell entity, either whole cells or nuclei. The BAG platform was built for versatility, allowing for the reagent customization needed to capture DNA and RNA from the same single cells in a high-throughput manner. As described in Fig. 1A-D, nucleic acid templates (or simply “templates”) are captured through primer hybridization to Acrydite-anchored primers embedded into balls of acrylamide gel, shortened to BAGs. This process is followed by primer extension, transcribing the template information of each single cell to primers securely tethered to a single BAG [35]. To establish cell identity, we used a pool-and-split synthesis approach to affix a BAG tag to each template, randomly assigning one of a million identities (96³) to every BAG. During pool-and-split, we also introduced a template tag (also known as varietal tag or UMI) to each template. For the hybrid protocol, we used both oligo-TG primers and oligo-T primers to capture DNA and RNA templates, followed by using DNA polymerase and reverse transcriptase to transcribe the templates onto the anchored primers simultaneously. We then prepared sequencing libraries by tagmentation.

!Fig. 1

Fig. 1 Overview of hybrid BAG-seq protocol and performance on cell-mixture experiment. A–D Key steps for hybrid BAG-seq pipeline. A Encapsulation: Individual cells or nuclei are encapsulated within droplets containing acrylamide and Acrydite-modified primers that are designed to capture both mRNA and genomic DNA. B Polymerization and primer extension: gel polymerization is followed by primer hybridization. Acrydite primers are extended by reverse transcriptase and DNA polymerase. C Split-and-pool barcoding:

Clustering RNA and DNA layers

We used the Seurat package [43] for our single-cell sequencing analysis-a tool widely recognized for its utility in gene expression clustering. For the RNA layer, we followed the standard methodology for expression clustering via “RunUMAP” and “FindClusters” functions. We extended the request of Seurat to cluster the DNA layer. We explored a range of DNA bin sizes and ultimately used a total of 300 bins for the DNA clustering and copy number analysis, as it provided a good balance of genomic resolution and average per-bin counts for a high signal-to-noise ratio and copy number reliability. These empirical bins range from 4.7 to 36.5 Mbp, with a median of 9.5 Mbp, and we treated the normalized raw DNA template count within each bin as a “gene input” for the clustering process and heatmap plotting. The segmentation information was only used for cluster-based copy number profile plotting, as shown in fig. 2 and additional file 3: Figs. S6-S9, panel B. This approach allowed us to identify shared copy number profiles that we could leverage in a manner similar to gene expression clustering (further detailed in Methods section).

We initially applied the molecular-layer concept to a mixture experiment involving two human cell lines: a normal fibroblast, SKN1, and a breast cancer cell line, SKBR3. The distributions of the basic parameters and downsampling curves from this experiment are shown in Additional file 3: Fig.S3B-G. We illustrate the clustering results and heatmaps based on the copy number and gene expression in Fig. 1E-I. The genomic and transcriptomic features from the hybrid protocol successfully recapitulate the published features of these two cell lines [35, 42]. The alluvial diagram (Fig. 1G) shows the projection of the genomic clones into the expression clusters. As expected, we observed a good one-to-one correlation between the genome and transcriptome of each cell type. Only 1.06% (10 out of 941) of the cells from the DNA cluster of one cell type (either SKN1 or SKBR3) projected to the RNA cluster of the other cell type, probably due to cell doublets.

A high degree of concordance was observed between the hybrid protocol and the DNA-only and RNA-only protocols. Specifically, 95.6% concordance was observed for copy number clustering results (Additional file 3: Fig. S4) in the DNA-only protocol, and a similar conclusion (95.2% concordance) was observed for gene expression clustering results in the RNA-only protocol (Additional file 3: Fig. S5), when restricting the analysis to the RNA layer or using all of the RNA templates mapped within transcripts.

Integrated Multi-omics Reveals Tumor Heterogeneity

To display the clustering results in UMAP space, with the hybrid data (in the green box) above and the DNA-only data (in the red box) below. Each point represents a single nucleus, color-coded by its DNA copy number cluster. Both methods resolved six clusters, which we manually aligned based on pattern similarity.The N cluster includes cells with a typical diploid profile, whereas the Nx cluster represents a subpopulation of diploid cells with loss of an X chromosome. The remaining clusters-A, B, C, and D-exhibit varied aneuploid copy number profiles. The adjacent heatmaps illustrate the distribution of copy number changes across the genome, with deletions in blue, amplifications in red, and the diploid state in white. Each single cell is represented by a column and the cells are grouped by their DNA cluster.

To verify the congruence of profiles between platforms, we compared the average copy number profiles for each cluster in Fig. 2B, with DNA-only data in red and the DNA layer of the hybrid data in green. To quantify the similarity of clustering results between the two protocols, we used the multinomial wheel approach to measure the proximity of every single tumor nucleus to the centroids of tumor clones steadfast both by its own protocol (either hybrid or DNA-only) and by the other protocol. As shown in additional file 3: Fig. S10, projecting DNA-only data to either DNA-only multinomial states or hybrid multinomial states showed no signal reduction (84.5% versus 84.5% nuclei within 2 units to the centroids),and similarly high concordance was observed when projecting hybrid data to either hybrid multinomial states or DNA-only multinomial states (77.2% versus 68.9% nuclei within 2 units to the centroids). Additionally, we examined heterozygous SNPs in both platforms and found similar patterns of loss of heterozygosity (loh) and allele imbalance. These allele imbalance patterns (Additional file 3: Fig. S11) align with the copy number calls, in that when the copy number is an odd integer, allele imbalance is always observed.

RNA layer

We next analyzed the RNA layer of hybrid data compared to the RNA-only protocol. We first combined all the nuclei from both the hybrid and RNA-only platforms and clustered the integrated dataset into 8 clusters as shown in Fig. 2C (leftmost “merged UMAP” plot). While each cluster is labeled with a unique identifier,hybrid nuclei are shown in blue,and RNA-only nuclei in red. We then split the merged dataset by experimental origin with hybrid nuclei shown above and RNA-only nuclei below. The rightmost plots in the panel reflect the clustering of each dataset independently into the same eight identified categories.

We reserve a discussion of the differentially expressed genes for later, but currently label the clusters as monocytes, T-cells, F (fibroblasts), EC (endothelial), EP (epithelial), plasma cells, and two distinct tumor RNA clusters, Ta and Tb. the central agreement matrices, formatted as heatmaps, show the consistency of cell classification across platforms within the merged dataset.The top matrix compares hybrid cluster assignments to merged dataset classifications,while the bottom does the same for RNA-only data. For both datasets, a meaningful proportion (95%) of cells align diagonally, confirming that cluster identities are well preserved across the two platforms.

Profiling genome and transcriptome of tumor samples using hybrid data

For each of the five tumors, we clustered the DNA and RNA layers of the hybrid data, respectively (Fig. 3). The DNA clusters are presented on the far left, accompanied by copy number heatmaps similar to the previous illustrations. On the far right, RNA clusters are displayed, along with a heatmap that illustrates the relative expression levels across sets of differentially expressed genes (blue for low expression, red f

Hybrid DNA-RNA Analysis Reveals Transcriptional Plasticity in Uterine Tumors

The analysis of five uterine tumor cases using a hybrid DNA-RNA protocol revealed varying degrees of correspondence between genomic and transcriptional profiles. Tumors 2 and 5 demonstrated strong alignment between DNA and RNA clusters, indicated by higher Rand Index and Adjusted Rand index values. conversely, Tumors 1, 3, and 4 exhibited more mixed projections across modalities, suggesting greater transcriptional plasticity. Detailed findings for each case are summarized below.

Tumor 1 (Uterine Carcinosarcoma): This biphasic tumor, containing carcinomatous and sarcomatous components, showed distinct gene expression patterns in RNA clusters Ta₁ and Tb₁. Ta₁ exhibited high expression of fibroblast-specific genes (FGFR3,COL9A2,COL27A1),aligning with the sarcomatous component,while Tb₁ showed lower expression of these genes. This separation was consistent across multiple platforms. Two tumor clones, differing in chromosome 13 copy number, projected equally to both RNA clusters, with no discernible correlation between DNA and RNA cluster assignments.

Tumor 2 (Uterine Serous Carcinoma): This tumor presented a more complex landscape.DNA clones A₂ and B₂ primarily projected to RNA cluster Ta₂, while clones C₂ and D₂ corresponded to Tb₂. Analysis focused on biologically meaningful RNA clusters consistent between the hybrid and RNA-only protocols. Subgroup analysis within Ta₂ (Ta₂-A₂ and Ta₂-B₂) revealed that genes differentiating these subgroups were not necessarily located in regions with copy number differences between the DNA clones.

Tumor 3 (Endometrial Adenocarcinoma): This tumor contained a single DNA clone projecting to two distinct RNA clusters: one estrogen receptor (ER) positive and the other ER negative. This observation mirrored pathological findings, with immunohistochemistry showing spatial intermixing of ER-positive and ER-negative cells within the tissue.

Integrated Analysis Reveals Cellular Relationships and Novel Subtypes

Our gene expression clustering procedure incorporates several novel features. We included 3500 “empty” cells derived from the RNA layer of DNA-only BAGs, which clustered together (Fig. 4A) and also attracted hybrid BAGs with severe RNA template depletion – representing 2.4% of hybrid cells.

We utilized Seurat’s FindClusters and UMAP functions to cluster and visualize 40,149 nuclei (Fig. 4A). Following initial clustering, we further refined the diploid cells to identify subpopulations, focusing on four sub-regions of the UMAP: blood elements, fibroblasts, endothelial cells, and epithelial cells. This iterative clustering approach is well-suited to the diverse cell type composition of the single-cell pool, which includes tumor, epithelial, endothelial, myeloid, and other cell types. Tumor subclusters were defined based on case-specific information established previously.

UMAP coordinates provide a planar depiction of the cells, and a multinomial analysis was employed to color cells by expression type.This involved creating a gene probability vector for each subcluster, normalizing the number of templates mapping to each gene by the total number of templates mapping to any gene, resulting in a probability distribution. This distribution defines a multinomial distribution for each cluster,which then assigns a cluster probability to each single cell based on its gene counts.in Fig. 4A, each point’s color represents the cluster with the highest probability of generating its gene expression profile.

Integrated cluster analysis and unique cluster identification were performed using aggregate tumor and normal tissue data from all patients (Fig. 4A & B). A neighbor-joining tree illustrates the relationships among stromal and tumor subtypes, computed from inter-cluster distances based on multinomial distributions (Fig. 4B). The source of nuclei from the hybrid protocol is shown in fig. 4C, and combined DNA clustering after removing nuclei clustered to the “empty” state is presented in Fig. 4D. Projections of tumor-genome (blue) and normal-genome (red) nuclei onto the RNA UMAP space for six tissue samples are shown in Fig. 4E-J, with unique stromal components specific to certain tissues circled and indicated by arrows (Fig. 4I & J) and marked on the tree (Fig. 4B). Table 1 details the co-clustering counts across all samples, presenting the distribution of nuclei for each patient and tissue sample categorized by DNA cluster (diploid, diploid with one X chromosome, and aneuploid) and further broken down by expression cluster.

### Quantitative RNA cluster Analysis and Identification of crossover Nuclei

B) provides a more quantitative view of these diverse stromal and tumor RNA clusters. The branches of the tree largely preserve cell-type categories. The blood elements share a common branch (blue labels) with a myeloid-derived sub-branch (dark blue) distinct from lymphocytes (light blue); a branch of epithelial cells (orange); fibroblasts (purple) and endothelial cells (green). Moast of the subclusters fall on their expected branch. The exception includes a single sub-branch containing two epithelial subclusters, EP₄ and EP₅, and osteoclasts, a myeloid cell type. In general, clusters that are close together in the UMAP (Fig. 4A) share a common branch in the tree (Fig. 4B). The major exception is cluster F₅, B-cell-like fibroblasts, which are near the B-cells in the UMAP but nearer to the fibroblasts in the tree.

The tumor clones from each patient occupy distinct sub-branches in the tree. The uterine leiomyosarcoma (purple), a muscle-derived tumor, has expression subclusters on the fibroblast branch of the tree. The other four tumors share a deep branch with the epithelial cells.one sub-branch contains the two uterine carcinosarcomas (red and dark red). Nearer the epithelial cells in the tree, are the endometrial adenocarcinoma (green) and nearer still the uterine serous carcinoma (blue). The branch lengths provide a relative measure of similarity, showing that Ta₁ and Tb₁ are highly similar as are Ta₃ and Tb₃. In contrast, the subclones of ta₂ and Tb₂ are far apart.

### Multinomial Wheel and Crossovers

In the previous section, we observed that the majority of nuclei exhibit concordant DNA and RNA profiles: diploid DNA with stromal RNA expression (flat, *N*) or complex DNA patterns with tumor RNA expression (CN +, T). These concordant nuclei constitute the expected biological behavior; though, there are a subset of cells that do not match this pattern. Additional file 8: Table S6 shows the counts for each patient of cells that are diploid or complex (flat or CN +) and map to normal clusters or tumor clusters (*N* or T). Across all patients and tissue samples, 1-5% of nuclei have flat copy number profiles and tumor expression patterns, or a combination of copy number variations and stromal expression patterns. Even though accounting for a small proportion of the total dataset, these crossover nuclei may represent an captivating population.Alternatively, they may be the result of unresolved doublet collisions [44]. Additional file 9: Table S7 summarizes the counts and calculates the proportion of concordant and crossover nuclei per patient.

To determine if crossover nuclei are a unique biological state or collision artifacts, we employed the multinomial wheel to differentiate mixed states. Integrating DNA-layer data across all hybrid-protocol experiments, we constructed a multinomial wheel with tumor clusters, diploid cells and diploid cells with X-loss as the individual spokes of the wheel (Additional file 3: Fig.S18A). Some tumor DNA clusters are so similar (A₁ and B₁,A₄ and B₄) that we collapse each of those pairs into a single node (A₁/B₁ and A₄/B₄). For each pair of the 11 vertices, we created 9 equally spaced sampling states, resulting in a multinomial wheel with (11 choose 2) = 55 spokes and 9*(11 choose 2) = 495 intermediate states.

From the DNA counts of each nucleus, we evaluate the probability that those observations are derived from each of the (495 + 11) = 506 multinomial distributions in the wheel. We assign each nucleus to the node with the highest posterior probability. Notably,more than 60% of nuclei align with a “pure” cluster on the multinomial wheel by residing on a vertex,and 85% are situated within two units’ distance from a pure cluster (Additiona

Analysis of chromosome X loss in somatic cells of the primary tumor 2

A Bar plot showing the ratio of X-haplotype observations in the X-loss populations of plasma (Plasma-Nx) and T-cell (T-cell-Nx) nuclei from patient 2. Tumor subclones A2 and B2 with only one copy of the chromosome X are used to phase the X chromosome SNPs in the Plasma-Nx and T-cell-Nx populations as belonging to one haplotype (red, match A2/B2) or the other (gray, mismatch A2/B2). T-cell-Nx nuclei exhibit a balanced distribution of SNVs from both haplotypes, while Plasma-Nx nuclei show a pronounced bias toward the A2/B2 haplotype. B A volcano plot shows the genes with statistically significant expression differences between Plasma-Nx and T-cell-Nx nuclei.Each dot represents a gene, with the x-axis showing the log2 fold change in expression and the y-axis showing the -log10 p-value. Genes with a significant increase in expression in Plasma-Nx nuclei are plotted to the right, while those with a significant decrease are plotted to the left. Genes with a p-value below the meaning threshold are highlighted in red. as previously reported[PanningBDausmanJJaenischRXchromosomeinactivationismediatedbyXistRNAstabilizationCell1997;90(5):907-16[PanningBDausmanJJaenischRXchromosomeinactivationismediatedbyXistRNAstabilizationCell1997;90(5):907-16⁴⁹, Weakley SM, Wang H, Yao Q, Chen C. Expression and function of a large non-coding RNA gene XIST in human cancer. World J Surg. 2011;35:1751-6.⁵⁰,51], these results also verify that nuclei from the “Nx” DNA clone lost their inactivated X chromosomes.

Hybrid BAG-seq: Genomic and Transcriptomic Interactions in Human Tumors