12.2 Calling Peaks w/ Macs2

Prior to calling peaks in ArchR, you must run the addGroupCoverages() function! ArchR relies on these group coverage objects to perform peak calling.

As mentioned above, we generate a reproducible peak set in ArchR using the addReproduciblePeakSet() function. By default ArchR attempts to call peaks using MACS2 and this is absolutely the recommended peak caller. ArchR also implements its own native peak caller but we do not recommend using it - this alternative peak calling method is described in the next section.

To call peaks using MACS2, ArchR must be able to find the MACS2 executable. First, ArchR looks in your PATH environment variable. If this is unsuccessful, ArchR attempts to determine if you have installed MACS2 with either pip or pip3. If neither of these is successful, ArchR gives up and provides an error message. If you have MACS2 installed and ArchR cannot find it, you should provide the path to the addReproduciblePeakSet() function via the pathToMacs2 parameter. It is important to make sure that the MACS2 executable that ArchR is using is what you expect it to use. Some users have multiple instances of MACS2 installed and ArchR may not automatically select the one that you expect.

pathToMacs2 <- findMacs2()

## Searching For MACS2..

## Found with $PATH at /usr/local/bin/macs2

You could, of course, manually designate the path to the MACS2 executable instead. This may be necessary on some computational environments.

With the path to MACS2 identified, we can then create a reproducible merged peak set w/ MACS2 (5-10 minutes). To avoid bias from pseudo-bulk replicates that have very few cells, we can provide a cutoff for the upper limit of the number of peaks called per cell via the peaksPerCell parameter. This prevents clusters with very few cells from contributing lots of low quality peaks to the merged peak set. There are many other parameters that can be tweaked in addReproduciblePeakSet() - try ?addReproduciblePeakSet or check out the online function definition for more information. Of particular note, the reproducibility parameter defines what is meant by a “reproducible peak”. For example, reproducibility = "2" means at least 2 pseudo-bulk replicates must have a peak call at this locus and reproducibility = "(n+1)/2" means that the majority of pseudo-bulk replicates must have a peak call at this locus.

projHeme4 <- addReproduciblePeakSet(
    ArchRProj = projHeme4, 
    groupBy = "Clusters2", 
    pathToMacs2 = pathToMacs2
)
## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-938fc9c7d-Date-2025-02-06_Time-01-57-22.608531.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with Macs2
## 2025-02-06 01:57:23.144445 : Peak Calling Parameters!, 0.009 mins elapsed.
##                 Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B                   B    421        416           2  162  254   150000
## CD4.M           CD4.M    633        586           2   86  500   150000
## CD4.N           CD4.N   1250        548           2   48  500   150000
## CLP               CLP    386        386           2   88  298   150000
## Erythroid   Erythroid    702        671           2  171  500   150000
## GMP               GMP   1261        810           2  310  500   150000
## Mono             Mono   2652       1000           2  500  500   150000
## NK                 NK    903        822           2  322  500   150000
## pDC               pDC    314        304           2  144  160   150000
## PreB             PreB    351        351           2   40  311   150000
## Progenitor Progenitor   1377        639           2  139  500   150000
## 2025-02-06 01:57:23.179426 : Batching Peak Calls!, 0.009 mins elapsed.
## 2025-02-06 01:57:23.321666 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2025-02-06 01:58:25.138741 : Identifying Reproducible Peaks!, 1.042 mins elapsed.
## 
## .rds"
## 2025-02-06 01:58:30.502697 : Creating Union Peak Set!, 1.132 mins elapsed.
## Converged after 9 iterations!
## Plotting Ggplot!
## 2025-02-06 01:58:38.413358 : Finished Creating Union Peak Set (147407)!, 1.263 mins elapsed.

The iterative overlap procedure that ArchR uses to create this reproducible peak set occurs in two stages. During the first step of addReproduciblePeakSet(), a reproducible peak set is created for each group indicated by groupBy. Using the psuedobulk replicates generated by addGroupCoverages(), ArchR identifies peaks that are reproducible across the replicates for each group indicated by the groupBy argument. The score of those peak calls is stored in the score column and the quantile rank of that score for the individual pseudobulk replicate that it was identified in is stored in replicateScoreQuantile. It is this quantile normalization that accounts for differences in number of fragments per group etc. that affect peak calling. For each of these groups, ArchR saves the group-specific peak set as a GRanges object in the PeakCalls subdirectory of the ArchRProject. Additionally, ArchR saves a a GRanges object that contains the summit calls for each pseudobulk replicate within the ./PeakCalls/ReplicateCalls directory.

list.files(path=paste0(getOutputDirectory(ArchRProj = projHeme4),"/PeakCalls"))
##  [1] "B-reproduciblePeaks.gr.rds"          "CD4.M-reproduciblePeaks.gr.rds"     
##  [3] "CD4.N-reproduciblePeaks.gr.rds"      "CLP-reproduciblePeaks.gr.rds"       
##  [5] "Clusters2"                           "Erythroid-reproduciblePeaks.gr.rds" 
##  [7] "GMP-reproduciblePeaks.gr.rds"        "Mono-reproduciblePeaks.gr.rds"      
##  [9] "NK-reproduciblePeaks.gr.rds"         "pDC-reproduciblePeaks.gr.rds"       
## [11] "PreB-reproduciblePeaks.gr.rds"       "Progenitor-reproduciblePeaks.gr.rds"

During the second step of addReproduciblePeakSet(), ArchR merges these group-specific peak sets into a single union peak set. The quantile rank of the peak score across the group is stored in groupScoreQuantile and the group it comes from is stored in GroupReplicate. GroupReplicate is an annotation for the group from which each peak originated during the process of iterative overlap peak merging. It is important to note that these annotations do not inherently mean that the given peak was only called in that cell group, rather that the annotated group had the highest normalized significance for that peak call.

Each ArchRProject object can only contain a single peak set at a time. As such, we assign the output of addReproduciblePeakSet() to our desired ArchRProject. If you would like to experiment with different peak sets, you have two options. Option #1 - you can save multiple copies of your ArchRProject using saveArchRProject(). Option #2 - you can create multiple peak sets, store them as GenomicRanges objects, and change your peak set using the addPeakSet() function.

The addPeakSet() function can be used to set any arbitrary peak set as the peak set for your ArchRProject.

To retrieve the peak set stored in your ArchRProject as a GRanges object, we use the getPeakSet() function.

getPeakSet(projHeme4)
## GRanges object with 147407 ranges and 13 metadata columns:
##              seqnames              ranges strand |     score
##                 <Rle>           <IRanges>  <Rle> | <numeric>
##         Mono     chr1       752503-753003      * |  41.29950
##            B     chr1       762688-763188      * |  46.92020
##          GMP     chr1       773648-774148      * |   6.32548
##        CD4.M     chr1       779906-780406      * |   8.42088
##            B     chr1       801002-801502      * |  11.64360
##          ...      ...                 ...    ... .       ...
##           NK     chrX 154807254-154807754      * |   4.70336
##         PreB     chrX 154840939-154841439      * |   1.45332
##         PreB     chrX 154841881-154842381      * |   4.49383
##           NK     chrX 154842390-154842890      * |  19.94720
##   Progenitor     chrX 154862036-154862536      * |  13.05070
##              replicateScoreQuantile groupScoreQuantile Reproducibility
##                           <numeric>          <numeric>       <numeric>
##         Mono                  0.855              0.733               2
##            B                  0.936              0.860               2
##          GMP                  0.585              0.242               2
##        CD4.M                  0.394              0.130               2
##            B                  0.608              0.318               2
##          ...                    ...                ...             ...
##           NK                  0.399              0.131               2
##         PreB                  0.242              0.052               2
##         PreB                  0.441              0.108               2
##           NK                  0.770              0.603               2
##   Progenitor                  0.710              0.256               2
##                      GroupReplicate distToGeneStart nearestGene    peakType
##                         <character>       <integer> <character> <character>
##         Mono  Mono._.scATAC_BMMC_R1           10148   LINC00115      Distal
##            B     B._.scATAC_PBMC_R1              32   LINC01128    Promoter
##          GMP GMP._.scATAC_CD34_BM..           10926   LINC01128    Intronic
##        CD4.M CD4.M._.scATAC_PBMC_R1           17184   LINC01128    Intronic
##            B     B._.scATAC_PBMC_R1           10929      FAM41C      Distal
##          ...                    ...             ...         ...         ...
##           NK    NK._.scATAC_BMMC_R1           35117       TMLHE    Intronic
##         PreB            PreB._.Rep2            1432       TMLHE    Intronic
##         PreB            PreB._.Rep1             490       TMLHE    Intronic
##           NK    NK._.scATAC_BMMC_R1              17       TMLHE    Promoter
##   Progenitor Progenitor._.scATAC_..           19663       TMLHE      Distal
##              distToTSS  nearestTSS        GC       idx         N
##              <integer> <character> <numeric> <integer> <numeric>
##         Mono     10148  uc010nxx.2    0.4830         1         0
##            B        32  uc031pjj.1    0.6966         2         0
##          GMP     10719  uc031pjk.1    0.4571         3         0
##        CD4.M     16977  uc031pjk.1    0.4850         4         0
##            B     10929  uc001abt.4    0.4371         5         0
##          ...       ...         ...       ...       ...       ...
##           NK     35117  uc004fnn.3    0.4990      3538         0
##         PreB      1432  uc004fnn.3    0.4271      3539         0
##         PreB       490  uc004fnn.3    0.5729      3540         0
##           NK        17  uc004fnn.3    0.6008      3541         0
##   Progenitor     19663  uc004fnn.3    0.4251      3542         0
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

The peak sets from addReproduciblePeakSet() are annotated with lots of metadata about each peak region. The score represents the normalized -log10(pval) returned by MACS2. There are other annotations such as distToGeneStart and distToTSS which are determined based on the contents of getGeneAnnotation(ArchRProj)$genes and getGeneAnnotation(ArchRProj)$TSS respectively. While it may seem like these should be identical, a single gene can have multiple TSSs and ArchR needs to deal with this multi-mapping issue, which we have done by including both annotations. For example, in ArchR’s default hg38 geneAnnotation, there are 23,274 genes and 49,052 TSSs. One confusing piece of information annotated in the peak set is the idx column which does not correspond to a unique peak identifier. Instead the idx column is actually only unique per-chromosome, so for a peak located on “chr1” with and idx of “5”, this means that this is the 5th peak in sequential order on “chr1” (but there would also be a peak with idx of “5” on “chr2”).