12.2 Calling Peaks w/ Macs2
Prior to calling peaks in ArchR, you must run the addGroupCoverages()
function! ArchR relies on these group coverage objects to perform peak calling.
As mentioned above, we generate a reproducible peak set in ArchR using the addReproduciblePeakSet()
function. By default ArchR attempts to call peaks using MACS2 and this is absolutely the recommended peak caller. ArchR also implements its own native peak caller but we do not recommend using it - this alternative peak calling method is described in the next section.
To call peaks using MACS2, ArchR must be able to find the MACS2 executable. First, ArchR looks in your PATH
environment variable. If this is unsuccessful, ArchR attempts to determine if you have installed MACS2 with either pip
or pip3
. If neither of these is successful, ArchR gives up and provides an error message. If you have MACS2 installed and ArchR cannot find it, you should provide the path to the addReproduciblePeakSet()
function via the pathToMacs2
parameter. It is important to make sure that the MACS2 executable that ArchR is using is what you expect it to use. Some users have multiple instances of MACS2 installed and ArchR may not automatically select the one that you expect.
## Searching For MACS2..
## Found with $PATH at /usr/local/bin/macs2
You could, of course, manually designate the path to the MACS2 executable instead. This may be necessary on some computational environments.
With the path to MACS2 identified, we can then create a reproducible merged peak set w/ MACS2 (5-10 minutes). To avoid bias from pseudo-bulk replicates that have very few cells, we can provide a cutoff for the upper limit of the number of peaks called per cell via the peaksPerCell
parameter. This prevents clusters with very few cells from contributing lots of low quality peaks to the merged peak set. There are many other parameters that can be tweaked in addReproduciblePeakSet()
- try ?addReproduciblePeakSet
or check out the online function definition for more information. Of particular note, the reproducibility
parameter defines what is meant by a “reproducible peak”. For example, reproducibility = "2"
means at least 2 pseudo-bulk replicates must have a peak call at this locus and reproducibility = "(n+1)/2"
means that the majority of pseudo-bulk replicates must have a peak call at this locus.
projHeme4 <- addReproduciblePeakSet(
ArchRProj = projHeme4,
groupBy = "Clusters2",
pathToMacs2 = pathToMacs2
)
## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-938fc9c7d-Date-2025-02-06_Time-01-57-22.608531.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with Macs2
## 2025-02-06 01:57:23.144445 : Peak Calling Parameters!, 0.009 mins elapsed.
## Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B B 421 416 2 162 254 150000
## CD4.M CD4.M 633 586 2 86 500 150000
## CD4.N CD4.N 1250 548 2 48 500 150000
## CLP CLP 386 386 2 88 298 150000
## Erythroid Erythroid 702 671 2 171 500 150000
## GMP GMP 1261 810 2 310 500 150000
## Mono Mono 2652 1000 2 500 500 150000
## NK NK 903 822 2 322 500 150000
## pDC pDC 314 304 2 144 160 150000
## PreB PreB 351 351 2 40 311 150000
## Progenitor Progenitor 1377 639 2 139 500 150000
## 2025-02-06 01:57:23.179426 : Batching Peak Calls!, 0.009 mins elapsed.
## 2025-02-06 01:57:23.321666 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2025-02-06 01:58:25.138741 : Identifying Reproducible Peaks!, 1.042 mins elapsed.
##
## .rds"
## 2025-02-06 01:58:30.502697 : Creating Union Peak Set!, 1.132 mins elapsed.
## Converged after 9 iterations!
## Plotting Ggplot!
## 2025-02-06 01:58:38.413358 : Finished Creating Union Peak Set (147407)!, 1.263 mins elapsed.
The iterative overlap procedure that ArchR uses to create this reproducible peak set occurs in two stages. During the first step of addReproduciblePeakSet()
, a reproducible peak set is created for each group indicated by groupBy
. Using the psuedobulk replicates generated by addGroupCoverages()
, ArchR identifies peaks that are reproducible across the replicates for each group indicated by the groupBy
argument. The score of those peak calls is stored in the score
column and the quantile rank of that score for the individual pseudobulk replicate that it was identified in is stored in replicateScoreQuantile
. It is this quantile normalization that accounts for differences in number of fragments per group etc. that affect peak calling. For each of these groups, ArchR saves the group-specific peak set as a GRanges
object in the PeakCalls
subdirectory of the ArchRProject
. Additionally, ArchR saves a a GRanges
object that contains the summit calls for each pseudobulk replicate within the ./PeakCalls/ReplicateCalls
directory.
list.files(path=paste0(getOutputDirectory(ArchRProj = projHeme4),"/PeakCalls"))
## [1] "B-reproduciblePeaks.gr.rds" "CD4.M-reproduciblePeaks.gr.rds"
## [3] "CD4.N-reproduciblePeaks.gr.rds" "CLP-reproduciblePeaks.gr.rds"
## [5] "Clusters2" "Erythroid-reproduciblePeaks.gr.rds"
## [7] "GMP-reproduciblePeaks.gr.rds" "Mono-reproduciblePeaks.gr.rds"
## [9] "NK-reproduciblePeaks.gr.rds" "pDC-reproduciblePeaks.gr.rds"
## [11] "PreB-reproduciblePeaks.gr.rds" "Progenitor-reproduciblePeaks.gr.rds"
During the second step of addReproduciblePeakSet()
, ArchR merges these group-specific peak sets into a single union peak set. The quantile rank of the peak score
across the group is stored in groupScoreQuantile
and the group it comes from is stored in GroupReplicate
. GroupReplicate
is an annotation for the group from which each peak originated during the process of iterative overlap peak merging. It is important to note that these annotations do not inherently mean that the given peak was only called in that cell group, rather that the annotated group had the highest normalized significance for that peak call.
Each ArchRProject
object can only contain a single peak set at a time. As such, we assign the output of addReproduciblePeakSet()
to our desired ArchRProject
. If you would like to experiment with different peak sets, you have two options. Option #1 - you can save multiple copies of your ArchRProject
using saveArchRProject()
. Option #2 - you can create multiple peak sets, store them as GenomicRanges
objects, and change your peak set using the addPeakSet()
function.
The addPeakSet()
function can be used to set any arbitrary peak set as the peak set for your ArchRProject
.
To retrieve the peak set stored in your ArchRProject
as a GRanges
object, we use the getPeakSet()
function.
getPeakSet(projHeme4)
## GRanges object with 147407 ranges and 13 metadata columns:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## Mono chr1 752503-753003 * | 41.29950
## B chr1 762688-763188 * | 46.92020
## GMP chr1 773648-774148 * | 6.32548
## CD4.M chr1 779906-780406 * | 8.42088
## B chr1 801002-801502 * | 11.64360
## ... ... ... ... . ...
## NK chrX 154807254-154807754 * | 4.70336
## PreB chrX 154840939-154841439 * | 1.45332
## PreB chrX 154841881-154842381 * | 4.49383
## NK chrX 154842390-154842890 * | 19.94720
## Progenitor chrX 154862036-154862536 * | 13.05070
## replicateScoreQuantile groupScoreQuantile Reproducibility
## <numeric> <numeric> <numeric>
## Mono 0.855 0.733 2
## B 0.936 0.860 2
## GMP 0.585 0.242 2
## CD4.M 0.394 0.130 2
## B 0.608 0.318 2
## ... ... ... ...
## NK 0.399 0.131 2
## PreB 0.242 0.052 2
## PreB 0.441 0.108 2
## NK 0.770 0.603 2
## Progenitor 0.710 0.256 2
## GroupReplicate distToGeneStart nearestGene peakType
## <character> <integer> <character> <character>
## Mono Mono._.scATAC_BMMC_R1 10148 LINC00115 Distal
## B B._.scATAC_PBMC_R1 32 LINC01128 Promoter
## GMP GMP._.scATAC_CD34_BM.. 10926 LINC01128 Intronic
## CD4.M CD4.M._.scATAC_PBMC_R1 17184 LINC01128 Intronic
## B B._.scATAC_PBMC_R1 10929 FAM41C Distal
## ... ... ... ... ...
## NK NK._.scATAC_BMMC_R1 35117 TMLHE Intronic
## PreB PreB._.Rep2 1432 TMLHE Intronic
## PreB PreB._.Rep1 490 TMLHE Intronic
## NK NK._.scATAC_BMMC_R1 17 TMLHE Promoter
## Progenitor Progenitor._.scATAC_.. 19663 TMLHE Distal
## distToTSS nearestTSS GC idx N
## <integer> <character> <numeric> <integer> <numeric>
## Mono 10148 uc010nxx.2 0.4830 1 0
## B 32 uc031pjj.1 0.6966 2 0
## GMP 10719 uc031pjk.1 0.4571 3 0
## CD4.M 16977 uc031pjk.1 0.4850 4 0
## B 10929 uc001abt.4 0.4371 5 0
## ... ... ... ... ... ...
## NK 35117 uc004fnn.3 0.4990 3538 0
## PreB 1432 uc004fnn.3 0.4271 3539 0
## PreB 490 uc004fnn.3 0.5729 3540 0
## NK 17 uc004fnn.3 0.6008 3541 0
## Progenitor 19663 uc004fnn.3 0.4251 3542 0
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
The peak sets from addReproduciblePeakSet()
are annotated with lots of metadata about each peak region. The score
represents the normalized -log10(pval)
returned by MACS2. There are other annotations such as distToGeneStart
and distToTSS
which are determined based on the contents of getGeneAnnotation(ArchRProj)$genes
and getGeneAnnotation(ArchRProj)$TSS
respectively. While it may seem like these should be identical, a single gene can have multiple TSSs and ArchR needs to deal with this multi-mapping issue, which we have done by including both annotations. For example, in ArchR’s default hg38 geneAnnotation
, there are 23,274 genes and 49,052 TSSs. One confusing piece of information annotated in the peak set is the idx
column which does not correspond to a unique peak identifier. Instead the idx
column is actually only unique per-chromosome, so for a peak located on “chr1” with and idx
of “5”, this means that this is the 5th peak in sequential order on “chr1” (but there would also be a peak with idx
of “5” on “chr2”).