12.2 Calling Peaks w/ Macs2
Prior to calling peaks in ArchR, you must run the addGroupCoverages()
function! ArchR relies on these group coverage objects to perform peak calling.
As mentioned above, we generate a reproducible peak set in ArchR using the addReproduciblePeakSet()
function. By default ArchR attempts to call peaks using MACS2 and this is absolutely the recommended peak caller. ArchR also implements its own native peak caller but we do not recommend using it - this alternative peak calling method is described in the next section.
To call peaks using MACS2, ArchR must be able to find the MACS2 executable. First, ArchR looks in your PATH
environment variable. If this is unsuccessful, ArchR attempts to determine if you have installed MACS2 with either pip
or pip3
. If neither of these is successful, ArchR gives up and provides an error message. If you have MACS2 installed and ArchR cannot find it, you should provide the path to the addReproduciblePeakSet()
function via the pathToMacs2
parameter. It is important to make sure that the MACS2 executable that ArchR is using is what you expect it to use. Some users have multiple instances of MACS2 installed and ArchR may not automatically select the one that you expect.
<- findMacs2() pathToMacs2
You could, of course, manually designate the path to the MACS2 executable instead. This may be necessary on some computational environments.
With the path to MACS2 identified, we can then create a reproducible merged peak set w/ MACS2 (5-10 minutes). To avoid bias from pseudo-bulk replicates that have very few cells, we can provide a cutoff for the upper limit of the number of peaks called per cell via the peaksPerCell
parameter. This prevents clusters with very few cells from contributing lots of low quality peaks to the merged peak set. There are many other parameters that can be tweaked in addReproduciblePeakSet()
- try ?addReproduciblePeakSet
or check out the online function definition for more information. Of particular note, the reproducibility
parameter defines what is meant by a “reproducible peak”. For example, reproducibility = "2"
means at least 2 pseudo-bulk replicates must have a peak call at this locus and reproducibility = "(n+1)/2"
means that the majority of pseudo-bulk replicates must have a peak call at this locus.
<- addReproduciblePeakSet(
projHeme4 ArchRProj = projHeme4,
groupBy = "Clusters2",
pathToMacs2 = pathToMacs2
)## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-371b0f34d5c2-Date-2022-12-23_Time-07-30-35.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with Macs2
## 2022-12-23 07:30:37 : Peak Calling Parameters!, 0.037 mins elapsed.
## Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B B 428 423 2 166 257 150000
## CD4.M CD4.M 614 584 2 84 500 150000
## CD4.N CD4.N 1239 540 2 40 500 150000
## CLP CLP 383 383 2 87 296 150000
## Erythroid Erythroid 709 670 2 170 500 150000
## GMP GMP 1200 796 2 296 500 150000
## Mono Mono 2672 1000 2 500 500 150000
## NK NK 932 830 2 330 500 150000
## pDC pDC 315 305 2 151 154 150000
## PreB PreB 354 354 2 40 314 150000
## Progenitor Progenitor 1404 638 2 138 500 150000
## 2022-12-23 07:30:37 : Batching Peak Calls!, 0.037 mins elapsed.
## 2022-12-23 07:30:37 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2022-12-23 07:32:20 : Identifying Reproducible Peaks!, 1.751 mins elapsed.
## 2022-12-23 07:32:30 : Creating Union Peak Set!, 1.912 mins elapsed.
## Converged after 8 iterations!
## Plotting Ggplot!
## 2022-12-23 07:32:36 : Finished Creating Union Peak Set (142475)!, 2.021 mins elapsed.
The iterative overlap procedure that ArchR uses to create this reproducible peak set occurs in two stages. During the first step of addReproduciblePeakSet()
, a reproducible peak set is created for each group indicated by groupBy
. Using the psuedobulk replicates generated by addGroupCoverages()
, ArchR identifies peaks that are reproducible across the replicates for each group indicated by the groupBy
argument. The score of those peak calls is stored in the score
column and the quantile rank of that score for the individual pseudobulk replicate that it was identified in is stored in replicateScoreQuantile
. It is this quantile normalization that accounts for differences in number of fragments per group etc. that affect peak calling. For each of these groups, ArchR saves the group-specific peak set as a GRanges
object in the PeakCalls
subdirectory of the ArchRProject
. Additionally, ArchR saves a a GRanges
object that contains the summit calls for each pseudobulk replicate within the ./PeakCalls/ReplicateCalls
directory.
list.files(path=paste0(getOutputDirectory(ArchRProj = projHeme4),"/PeakCalls"))
## [1] "B-reproduciblePeaks.gr.rds" "CD4.M-reproduciblePeaks.gr.rds"
## [3] "CD4.N-reproduciblePeaks.gr.rds" "CLP-reproduciblePeaks.gr.rds"
## [5] "Clusters2" "Erythroid-reproduciblePeaks.gr.rds"
## [7] "GMP-reproduciblePeaks.gr.rds" "Mono-reproduciblePeaks.gr.rds"
## [9] "NK-reproduciblePeaks.gr.rds" "pDC-reproduciblePeaks.gr.rds"
## [11] "PreB-reproduciblePeaks.gr.rds" "Progenitor-reproduciblePeaks.gr.rds"
During the second step of addReproduciblePeakSet()
, ArchR merges these group-specific peak sets into a single union peak set. The quantile rank of the peak score
across the group is stored in groupScoreQuantile
and the group it comes from is stored in GroupReplicate
. GroupReplicate
is an annotation for the group from which each peak originated during the process of iterative overlap peak merging. It is important to note that these annotations do not inherently mean that the given peak was only called in that cell group, rather that the annotated group had the highest normalized significance for that peak call.
Each ArchRProject
object can only contain a single peak set at a time. As such, we assign the output of addReproduciblePeakSet()
to our desired ArchRProject
. If you would like to experiment with different peak sets, you have two options. Option #1 - you can save multiple copies of your ArchRProject
using saveArchRProject()
. Option #2 - you can create multiple peak sets, store them as GenomicRanges
objects, and change your peak set using the addPeakSet()
function.
The addPeakSet()
function can be used to set any arbitrary peak set as the peak set for your ArchRProject
.
To retrieve the peak set stored in your ArchRProject
as a GRanges
object, we use the getPeakSet()
function.
getPeakSet(projHeme4)
## GRanges object with 142475 ranges and 13 metadata columns:
## seqnames ranges strand | score replicateScoreQuantile
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## Mono chr1 752494-752994 * | 35.0551 0.829
## B chr1 762696-763196 * | 48.7983 0.939
## B chr1 801002-801502 * | 11.4240 0.609
## GMP chr1 805065-805565 * | 30.1400 0.953
## CLP chr1 845326-845826 * | 2.7681 0.704
## ... ... ... ... . ... ...
## CD4.M chrX 154493043-154493543 * | 7.04640 0.334
## NK chrX 154493558-154494058 * | 28.07310 0.831
## NK chrX 154807254-154807754 * | 4.58050 0.405
## GMP chrX 154842383-154842883 * | 8.14111 0.750
## GMP chrX 154996993-154997493 * | 4.89826 0.507
## groupScoreQuantile Reproducibility GroupReplicate
## <numeric> <numeric> <character>
## Mono 0.685 2 Mono._.scATAC_BMMC_R1
## B 0.861 2 B._.scATAC_PBMC_R1
## B 0.316 2 B._.scATAC_PBMC_R1
## GMP 0.892 2 GMP._.scATAC_BMMC_R1
## CLP 0.310 2 CLP._.scATAC_BMMC_R1
## ... ... ... ...
## CD4.M 0.061 2 CD4.M._.scATAC_PBMC_R1
## NK 0.709 2 NK._.scATAC_BMMC_R1
## NK 0.133 2 NK._.scATAC_BMMC_R1
## GMP 0.475 2 GMP._.scATAC_BMMC_R1
## GMP 0.200 2 GMP._.scATAC_CD34_BM..
## distToGeneStart nearestGene peakType distToTSS nearestTSS GC
## <integer> <character> <character> <integer> <character> <numeric>
## Mono 10157 LINC00115 Distal 10157 uc010nxx.2 0.4850
## B 24 LINC01128 Promoter 24 uc031pjj.1 0.6966
## B 10929 FAM41C Distal 10929 uc001abt.4 0.4371
## GMP 6866 FAM41C Intronic 6866 uc001abt.4 0.7345
## CLP 9240 LINC02593 Distal 1238 uc001abu.1 0.7904
## ... ... ... ... ... ... ...
## CD4.M 558 RAB39B Intronic 558 uc004fne.3 0.5609
## NK 43 RAB39B Promoter 43 uc004fne.3 0.5828
## NK 35117 TMLHE Intronic 35117 uc004fnn.3 0.4990
## GMP 10 TMLHE Promoter 10 uc004fnn.3 0.6088
## GMP 154620 TMLHE Distal 207 uc004fnq.1 0.5449
## idx N
## <integer> <numeric>
## Mono 1 0
## B 2 0
## B 3 0
## GMP 4 0
## CLP 5 0
## ... ... ...
## CD4.M 3416 0
## NK 3417 0
## NK 3418 0
## GMP 3419 0
## GMP 3420 0
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
The peak sets from addReproduciblePeakSet()
are annotated with lots of metadata about each peak region. The score
represents the normalized -log10(pval)
returned by MACS2. There are other annotations such as distToGeneStart
and distToTSS
which are determined based on the contents of getGeneAnnotation(ArchRProj)$genes
and getGeneAnnotation(ArchRProj)$TSS
respectively. While it may seem like these should be identical, a single gene can have multiple TSSs and ArchR needs to deal with this multi-mapping issue, which we have done by including both annotations. For example, in ArchR’s default hg38 geneAnnotation
, there are 23,274 genes and 49,052 TSSs. One confusing piece of information annotated in the peak set is the idx
column which does not correspond to a unique peak identifier. Instead the idx
column is actually only unique per-chromosome, so for a peak located on “chr1” with and idx
of “5”, this means that this is the 5th peak in sequential order on “chr1” (but there would also be a peak with idx
of “5” on “chr2”).