Introduction to scMultiMap
Chang Su
2025-03-01
scMultiMap.Rmd
This vignette shows two examples of applying scMultiMap
to infer cell-type-specific peak-gene associations with single-cell
multimodal data.
library(scMultiMap)
library(Signac)
library(Seurat)
# for consistency analysis
library(GenomicRanges)
library(rtracklayer)
1. Toy example
To illustrate the basic functionality of scMultiMap
, we
use a toy Seurat object small_obj
and a data frame of
candidate peak-gene pairs small_pairs_df
provided in the R
package to run scMultiMap
:
# a Seurat object
small_obj
#> An object of class Seurat
#> 350 features across 100 samples within 2 assays
#> Active assay: RNA (100 features, 0 variable features)
#> 1 layer present: counts
#> 1 other assay present: peak
# candidate peak-gene pairs
head(small_pairs_df)
#> gene peak
#> 1 MALAT1 chr11-65013281-65014841
#> 2 MALAT1 chr11-65026934-65027962
#> 3 MALAT1 chr11-65040583-65041780
#> 4 MALAT1 chr11-65083457-65084665
#> 5 MALAT1 chr11-65095679-65096871
#> 6 MALAT1 chr11-65110172-65110785
dim(small_pairs_df)
#> [1] 250 2
With these example data, we then run scMultiMap
:
res <- scMultiMap(small_obj, small_pairs_df) # < 0.1 seconds
#> Start step 1: IRLS
#> Start IRLS for RNA
#> Start IRLS for peak
#> Start step 2: WLS
#> There are 10 unique genes in the peak-gene pairs.
#> scMultiMap elapsed time: 00:00 (mm:ss)
head(res)
#> gene peak pval test_stat covar padj
#> 1 MALAT1 chr11-65013281-65014841 0.15215553 1.4319591 3.100282e-05 0.8273097
#> 2 MALAT1 chr11-65026934-65027962 0.79384145 0.2613256 5.204042e-06 0.9753992
#> 3 MALAT1 chr11-65040583-65041780 0.68116942 0.4108679 8.623731e-06 0.9472305
#> 4 MALAT1 chr11-65083457-65084665 0.03236884 2.1398259 4.879823e-05 0.7196890
#> 5 MALAT1 chr11-65095679-65096871 0.02264609 -2.2793530 -4.308216e-05 0.7196890
#> 6 MALAT1 chr11-65110172-65110785 0.88810490 -0.1407026 -1.919029e-06 0.9881934
#> cor
#> 1 0.00000000
#> 2 0.10216095
#> 3 0.29533510
#> 4 0.62091902
#> 5 -0.57664094
#> 6 -0.08605519
dim(res)
#> [1] 250 7
Here, scMultiMap takes a Seurat object with two modalities: RNA and peak, and evaluates the associations between 250 peak-gene pairs. We assume that this Seurat object contains the cells from a specific cell type of interest, such that the inferred associations are cell-type-specific.
2. PBMC example, 10x Multiome data
2.1 Note: fragment counts are required for scMultiMap
Before demonstrating scMultiMap
on real data, we
highlight an important requirement for peak counts. As discussed in
“Discussion” in scMultiMap’s manuscript,
scMultiMap is designed to model fragment counts, which
are more suitable for count-based modeling and improve downstream
analysis performance compared to insertion counts or read counts (Miao and Kim,
2024, Martens et
al., 2024). Therefore, if you plan to use scMultiMap
,
ensure the your scATAC-seq data consist of fragment counts instead of
read counts. We recommend using MACS2 for peak calling, as it generates
fragment counts (see Calling
peaks). If you only have read counts and do not have access to the
ATAC-seq fragment files, fragment counts can be estimated following the
approach in Martens et
al., 2024.
To demonstrate this, we downloaded a 10x Multiome dataset (paired
scRNA-seq and scATAC-seq) from 10x, used MACS2 to call peaks by cell
types, and generated processed_seurat_obj_PBMC_10k.rds
. The
source code for these steps can be found at scMultiMap
reproducibility Github repo. Calling peak by cell type is to
generate cell-type-specific peaks that better represent candidate
cell-type-specific enhancers than those identified from merged cell
types.
2.2 Demo of scMultiMap on CD14 Monocytes
# set it to the directory where data are saved
data_dir <- '../../../data'
pbmc <- readRDS(sprintf('%s/processed_seurat_obj_PBMC_10k_nextgem.rds', data_dir))
# in this example, we focus on CD14 monocytes:
pbmc_cd14 <- subset(pbmc, predicted.id == 'CD14 Mono')
print(pbmc_cd14)
We choose to study the top 2000 highly expressed genes and top 20000 highly accessible peaks in CD14 Monocytes, and use a cis-region of width 1Mb to define candidate peak-gene pairs.
pairs_df <- get_top_peak_gene_pairs(pbmc_cd14,
gene_top=2000, peak_top=20000,
distance = 5e+5,
gene_assay = 'RNA', # name of the gene assay
peak_assay = 'peaks_ct') # name of the peak assay
head(pairs_df)
#> gene peak
#> 1 MALAT1 chr11-65614961-65616132
#> 2 MALAT1 chr11-65918502-65920142
#> 3 MALAT1 chr11-65856895-65858482
#> 4 MALAT1 chr11-65496994-65497909
#> 5 MALAT1 chr11-65569536-65570572
#> 6 MALAT1 chr11-65711306-65712797
dim(pairs_df)
#> [1] 32296 2
Here, we generated 32296 candidate peak-gene pairs generated based on the criterion above. Generally, the pairs to be studied could also be customized and supplied by users.
Nect, we run scMultiMap
to assess peak-gene
associations.
scMultiMap_res <- scMultiMap(pbmc_cd14, pairs_df,
gene_assay = 'RNA', # name of the gene assay
peak_assay = 'peaks_ct') # name of the peak assay
#> Start step 1: IRLS
#> Start IRLS for RNA
#> Start IRLS for peaks_ct
#> Start step 2: WLS
#> There are 1909 unique genes in the peak-gene pairs.
#> scMultiMap elapsed time: 00:19 (mm:ss)
head(scMultiMap_res)
#> gene peak pval test_stat covar padj
#> 1 MALAT1 chr11-65614961-65616132 0.90652298 0.1174254 1.743388e-09 0.9836402
#> 2 MALAT1 chr11-65918502-65920142 0.03734129 2.0820129 3.005705e-08 0.3367015
#> 3 MALAT1 chr11-65856895-65858482 0.26443486 1.1159703 1.604796e-08 0.7232544
#> 4 MALAT1 chr11-65496994-65497909 0.08524652 1.7210237 2.334859e-08 0.4901409
#> 5 MALAT1 chr11-65569536-65570572 0.18919341 1.3129690 1.776141e-08 0.6525889
#> 6 MALAT1 chr11-65711306-65712797 0.25426633 1.1400479 1.496624e-08 0.7139586
#> cor
#> 1 0.01180729
#> 2 0.16766414
#> 3 0.07195509
#> 4 0.15467261
#> 5 0.09259916
#> 6 0.09048044
- Running
scMultiMap
on this dataset took ~20-25 seconds using a single core of an Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz. . -
res
is a dataframe with association statistics and p-values for each of 32296 peak-gene pairs.
2.3 Downstream analysis: consistency analysis
To demonstrate scMultiMap
is powerful for identifying
cell-type-specific enhancer-gene pairs, we consider the examples
presented in Figure~2b of the manuscript
and evaluate the consistency (overlap) between results from
scMultiMap
and that from promoter capture Hi-C (PCHiC).
# Data required for liftover form hg19 to hg38
# from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
chain = import.chain(sprintf("%s/hg19ToHg38.over.chain", data_dir))
First, we preprocess PCHiC data and lift it over from hg19 to hg38 to match the genome version used in the 10x Multiome data (hg38).
# Obtain promoter capture Hi-C data from http://dx.doi.org/10.1016/j.cell.2016.09.037.
# supplementary data "DATA_S1/PCHiC_peak_matrix_cutoff5.tsv"
PCHiC <- read.table(sprintf('%s/PCHiC_peak_matrix_cutoff5.tsv', data_dir), header = T)
# focus on CD14 monocytes as denoted by the column 'Mon'
PCHiC <- PCHiC[, c('oeChr', 'oeStart', 'oeEnd', # peak info
'baitName', # gene name,
'Mon')] # CHiCAGO scores for CD14 monocytes
colnames(PCHiC)[1:3] <- c('chr', 'start', 'end')
grange_hg19 <- GenomicRanges::makeGRangesFromDataFrame(df = PCHiC, keep.extra.columns = T)
seqlevelsStyle(grange_hg19) = "UCSC"
grange_hg38 <- rtracklayer::liftOver(grange_hg19, chain)
grange_hg38 <- unlist(grange_hg38)
genome(grange_hg38) = "hg38"
# significantly associated peak-gene pairs
# defined by CHiCAGO scores >=5 (reference: http://dx.doi.org/10.1016/j.cell.2016.09.037)
sig_assay_gr <- grange_hg38[grange_hg38@elementMetadata[['Mon']] > 5]
sig_assay_gr$gene <- sig_assay_gr$baitName
scMultiMap
We then evaluate its overlap with scMultiMap
’s
results.
scMultiMap_overlap <- validate_with_assay(scMultiMap_res, sig_assay_gr, pvar = "padj", p_cutoff = 0.2)
#> [1] "p value from one-sided Fisher exact test: 5.39e-12, #overlap: 246, odds ratio: 1.74"
Signac
In comparison, we also evaluated Signac
’s results.
# By default, Signac uses normalized data to compute peak-gene associations
pbmc_cd14 <- NormalizeData(object = pbmc_cd14)
#####
##### The code below could take 3 hours to run, as
##### (1) it consider all peaks instead of peaks with high abundance
##### (2) constructing the sampling-based null distribution is computationally intensive
# Run Signac
signac_res <- LinkPeaks(pbmc_cd14, 'peaks_ct', 'RNA',
genes.use = unique(pairs_df$gene), # consider the same genes as above
pvalue_cutoff = 1, score_cutoff = 0) # return the results on all pairs
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03h 17m 21s
######
# Bug in Signac's p-value (01/2025):
# the two-sided p-values are evaluated with only one tail of the Gaussian distribution
# https://github.com/stuart-lab/signac/blob/8ecdde2/R/links.R#L481C29-L481C30
summary(Links(signac_res)$pvalue) # the maximum is smaller than 1
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0000 0.1167 0.2449 0.2449 0.3719 0.5000
# manually adjust this
Links(signac_res)$pvalue <- 2*Links(signac_res)$pvalue
signac_res <- as.data.frame(Links(signac_res))
# consider the same set of peak-gene pairs as scMultiMap
scMultiMap_res$pg <- paste(scMultiMap_res$peak, scMultiMap_res$gene, sep = ':')
signac_res$pg <- paste(signac_res$peak, signac_res$gene, sep = ':')
signac_res <- signac_res[match(scMultiMap_res$pg, signac_res$pg), ]
# consistent with evaluating scMultiMap's result, we use BH procedure and set the significance cutoff to be 0.2
signac_res$padj <- p.adjust(signac_res$pvalue, method = 'BH')
signac_overlap <- validate_with_assay(signac_res, sig_assay_gr, pvar = "padj", p_cutoff = 0.2)
#> [1] "p value from one-sided Fisher exact test: 7.61e-02, #overlap: 59, odds ratio: 1.23"
SCENT
We further evaluate SCENT, another method benchmarked in scMultiMap’s manuscript, you could try the code below. Please note that it could extremely computationally intensive due to the use of boostrapping. Running the codes below took 17 hours, 34 minutes using 8 cores of an Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz. The high computational costs were also demonstrated by Figure 1c in scMultiMap’s manuscript.
# Tutorial for running SCENT:
# https://github.com/immunogenomics/SCENT
if (!requireNamespace("SCENT", quietly = TRUE)) devtools::install_github("immunogenomics/SCENT")
library(SCENT)
# extract count matrices
peak_counts <- pbmc_cd14[['peaks_ct']]@counts[unique(pairs_df$peak),]
rna_counts <- pbmc_cd14[['RNA']]@counts[unique(pairs_df$gene),]
# construct metadata for cell-level covariates
mdata <- pbmc_cd14@meta.data
mdata$cell <- rownames(mdata)
mdata$log_total_counts <- log(colSums(pbmc_cd14[['RNA']]@counts))
covariates_set <- "log_total_counts"
SCENT_obj <- CreateSCENTObj(rna = rna_counts, atac = peak_counts,
meta.data = mdata,
peak.info = pairs_df,
covariates = covariates_set,
celltypes = 'predicted.id') # column name for the cell type label in `mdata`
start_time <- Sys.time()
SCENT_obj <- SCENT_algorithm(object = SCENT_obj, celltype = 'CD14 Mono',
ncores=8, # for parallelism
regr = 'poisson', bin = TRUE) # default
end_time <- Sys.time()
SCENT.result <- SCENT_obj@SCENT.result
# print elapsed time
elapsed <- difftime(end_time, start_time, units = "secs")
message(sprintf("scMultiMap elapsed time: %02d:%02d (mm:ss)\n",
(as.integer(elapsed) %% 3600) %/% 60, # Minutes
as.integer(elapsed) %% 60)) # Seconds
SCENT.result$padj <- p.adjust(SCENT.result$boot_basic_p, method = 'BH')
SCENT_overlap <- validate_with_assay(SCENT.result, sig_assay_gr, pvar = "padj", p_cutoff = 0.2)
#> [1] "p value from one-sided Fisher exact test: 5.77e-01, #overlap: 149, odds ratio: 0.98"
Summary of comparison
In this example, scMultiMap
identified 246 pairs
overlapped with PCHiC data on CD14 monocytes, while Signac
and SCENT
only identified 59 and 149 pairs, respectively.
Moreover, scMultiMap
also gave the highest enrichment (odds
ratio=1.74) compared to Signac
(1.23) and
SCENT
(0.98). In particular, SCENT
has a odds
ratio smaller than 1, potentially due to a large number of false
positive discoveries that are not consistent with PCHiC.
This is one example that demonstrates the improved performance by
scMultiMap
compared to existing methods. Please refer to
scMultiMap’s manuscript
for more examples, including additional consistency analysis with
orthogonal assays, reproducibility analysis across independent
single-cell multimodal datasets, and systematic evaluation of type I
error and power.
Note that the numerical results (i.e., the numbers of reproduced pairs and odds ratio) presented here differ slightly from those in the paper, as the published analysis combined four single-cell Multiome PBMC datasets to increase power (see Methods in manuscript). However, the overall conclusions regarding the comparison across methods remain unchanged.
More vignettes
Please visit vignettes: scMultiMap for
disease-control studies and scMultiMap for
integrative analysis with GWAS results if you are interested in more
examples of analysis with scMultiMap
!