12.2 Calling Peaks w/ Macs2

Prior to calling peaks in ArchR, you must run the addGroupCoverages() function! ArchR relies on these group coverage objects to perform peak calling.

As mentioned above, we generate a reproducible peak set in ArchR using the addReproduciblePeakSet() function. By default ArchR attempts to call peaks using MACS2 and this is absolutely the recommended peak caller. ArchR also implements its own native peak caller but we do not recommend using it - this alternative peak calling method is described in the next section.

To call peaks using MACS2, ArchR must be able to find the MACS2 executable. First, ArchR looks in your PATH environment variable. If this is unsuccessful, ArchR attempts to determine if you have installed MACS2 with either pip or pip3. If neither of these is successful, ArchR gives up and provides an error message. If you have MACS2 installed and ArchR cannot find it, you should provide the path to the addReproduciblePeakSet() function via the pathToMacs2 parameter. It is important to make sure that the MACS2 executable that ArchR is using is what you expect it to use. Some users have multiple instances of MACS2 installed and ArchR may not automatically select the one that you expect.

pathToMacs2 <- findMacs2()

You could, of course, manually designate the path to the MACS2 executable instead. This may be necessary on some computational environments.

With the path to MACS2 identified, we can then create a reproducible merged peak set w/ MACS2 (5-10 minutes). To avoid bias from pseudo-bulk replicates that have very few cells, we can provide a cutoff for the upper limit of the number of peaks called per cell via the peaksPerCell parameter. This prevents clusters with very few cells from contributing lots of low quality peaks to the merged peak set. There are many other parameters that can be tweaked in addReproduciblePeakSet() - try ?addReproduciblePeakSet or check out the online function definition for more information. Of particular note, the reproducibility parameter defines what is meant by a “reproducible peak”. For example, reproducibility = "2" means at least 2 pseudo-bulk replicates must have a peak call at this locus and reproducibility = "(n+1)/2" means that the majority of pseudo-bulk replicates must have a peak call at this locus.

projHeme4 <- addReproduciblePeakSet(
    ArchRProj = projHeme4, 
    groupBy = "Clusters2", 
    pathToMacs2 = pathToMacs2
)
## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-371b0f34d5c2-Date-2022-12-23_Time-07-30-35.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with Macs2
## 2022-12-23 07:30:37 : Peak Calling Parameters!, 0.037 mins elapsed.
##                 Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B                   B    428        423           2  166  257   150000
## CD4.M           CD4.M    614        584           2   84  500   150000
## CD4.N           CD4.N   1239        540           2   40  500   150000
## CLP               CLP    383        383           2   87  296   150000
## Erythroid   Erythroid    709        670           2  170  500   150000
## GMP               GMP   1200        796           2  296  500   150000
## Mono             Mono   2672       1000           2  500  500   150000
## NK                 NK    932        830           2  330  500   150000
## pDC               pDC    315        305           2  151  154   150000
## PreB             PreB    354        354           2   40  314   150000
## Progenitor Progenitor   1404        638           2  138  500   150000
## 2022-12-23 07:30:37 : Batching Peak Calls!, 0.037 mins elapsed.
## 2022-12-23 07:30:37 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2022-12-23 07:32:20 : Identifying Reproducible Peaks!, 1.751 mins elapsed.
## 2022-12-23 07:32:30 : Creating Union Peak Set!, 1.912 mins elapsed.
## Converged after 8 iterations!
## Plotting Ggplot!
## 2022-12-23 07:32:36 : Finished Creating Union Peak Set (142475)!, 2.021 mins elapsed.

The iterative overlap procedure that ArchR uses to create this reproducible peak set occurs in two stages. During the first step of addReproduciblePeakSet(), a reproducible peak set is created for each group indicated by groupBy. Using the psuedobulk replicates generated by addGroupCoverages(), ArchR identifies peaks that are reproducible across the replicates for each group indicated by the groupBy argument. The score of those peak calls is stored in the score column and the quantile rank of that score for the individual pseudobulk replicate that it was identified in is stored in replicateScoreQuantile. It is this quantile normalization that accounts for differences in number of fragments per group etc. that affect peak calling. For each of these groups, ArchR saves the group-specific peak set as a GRanges object in the PeakCalls subdirectory of the ArchRProject. Additionally, ArchR saves a a GRanges object that contains the summit calls for each pseudobulk replicate within the ./PeakCalls/ReplicateCalls directory.

list.files(path=paste0(getOutputDirectory(ArchRProj = projHeme4),"/PeakCalls"))
##  [1] "B-reproduciblePeaks.gr.rds"          "CD4.M-reproduciblePeaks.gr.rds"     
##  [3] "CD4.N-reproduciblePeaks.gr.rds"      "CLP-reproduciblePeaks.gr.rds"       
##  [5] "Clusters2"                           "Erythroid-reproduciblePeaks.gr.rds" 
##  [7] "GMP-reproduciblePeaks.gr.rds"        "Mono-reproduciblePeaks.gr.rds"      
##  [9] "NK-reproduciblePeaks.gr.rds"         "pDC-reproduciblePeaks.gr.rds"       
## [11] "PreB-reproduciblePeaks.gr.rds"       "Progenitor-reproduciblePeaks.gr.rds"

During the second step of addReproduciblePeakSet(), ArchR merges these group-specific peak sets into a single union peak set. The quantile rank of the peak score across the group is stored in groupScoreQuantile and the group it comes from is stored in GroupReplicate. GroupReplicate is an annotation for the group from which each peak originated during the process of iterative overlap peak merging. It is important to note that these annotations do not inherently mean that the given peak was only called in that cell group, rather that the annotated group had the highest normalized significance for that peak call.

Each ArchRProject object can only contain a single peak set at a time. As such, we assign the output of addReproduciblePeakSet() to our desired ArchRProject. If you would like to experiment with different peak sets, you have two options. Option #1 - you can save multiple copies of your ArchRProject using saveArchRProject(). Option #2 - you can create multiple peak sets, store them as GenomicRanges objects, and change your peak set using the addPeakSet() function.

The addPeakSet() function can be used to set any arbitrary peak set as the peak set for your ArchRProject.

To retrieve the peak set stored in your ArchRProject as a GRanges object, we use the getPeakSet() function.

getPeakSet(projHeme4)
## GRanges object with 142475 ranges and 13 metadata columns:
##         seqnames              ranges strand |     score replicateScoreQuantile
##            <Rle>           <IRanges>  <Rle> | <numeric>              <numeric>
##    Mono     chr1       752494-752994      * |   35.0551                  0.829
##       B     chr1       762696-763196      * |   48.7983                  0.939
##       B     chr1       801002-801502      * |   11.4240                  0.609
##     GMP     chr1       805065-805565      * |   30.1400                  0.953
##     CLP     chr1       845326-845826      * |    2.7681                  0.704
##     ...      ...                 ...    ... .       ...                    ...
##   CD4.M     chrX 154493043-154493543      * |   7.04640                  0.334
##      NK     chrX 154493558-154494058      * |  28.07310                  0.831
##      NK     chrX 154807254-154807754      * |   4.58050                  0.405
##     GMP     chrX 154842383-154842883      * |   8.14111                  0.750
##     GMP     chrX 154996993-154997493      * |   4.89826                  0.507
##         groupScoreQuantile Reproducibility         GroupReplicate
##                  <numeric>       <numeric>            <character>
##    Mono              0.685               2  Mono._.scATAC_BMMC_R1
##       B              0.861               2     B._.scATAC_PBMC_R1
##       B              0.316               2     B._.scATAC_PBMC_R1
##     GMP              0.892               2   GMP._.scATAC_BMMC_R1
##     CLP              0.310               2   CLP._.scATAC_BMMC_R1
##     ...                ...             ...                    ...
##   CD4.M              0.061               2 CD4.M._.scATAC_PBMC_R1
##      NK              0.709               2    NK._.scATAC_BMMC_R1
##      NK              0.133               2    NK._.scATAC_BMMC_R1
##     GMP              0.475               2   GMP._.scATAC_BMMC_R1
##     GMP              0.200               2 GMP._.scATAC_CD34_BM..
##         distToGeneStart nearestGene    peakType distToTSS  nearestTSS        GC
##               <integer> <character> <character> <integer> <character> <numeric>
##    Mono           10157   LINC00115      Distal     10157  uc010nxx.2    0.4850
##       B              24   LINC01128    Promoter        24  uc031pjj.1    0.6966
##       B           10929      FAM41C      Distal     10929  uc001abt.4    0.4371
##     GMP            6866      FAM41C    Intronic      6866  uc001abt.4    0.7345
##     CLP            9240   LINC02593      Distal      1238  uc001abu.1    0.7904
##     ...             ...         ...         ...       ...         ...       ...
##   CD4.M             558      RAB39B    Intronic       558  uc004fne.3    0.5609
##      NK              43      RAB39B    Promoter        43  uc004fne.3    0.5828
##      NK           35117       TMLHE    Intronic     35117  uc004fnn.3    0.4990
##     GMP              10       TMLHE    Promoter        10  uc004fnn.3    0.6088
##     GMP          154620       TMLHE      Distal       207  uc004fnq.1    0.5449
##               idx         N
##         <integer> <numeric>
##    Mono         1         0
##       B         2         0
##       B         3         0
##     GMP         4         0
##     CLP         5         0
##     ...       ...       ...
##   CD4.M      3416         0
##      NK      3417         0
##      NK      3418         0
##     GMP      3419         0
##     GMP      3420         0
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

The peak sets from addReproduciblePeakSet() are annotated with lots of metadata about each peak region. The score represents the normalized -log10(pval) returned by MACS2. There are other annotations such as distToGeneStart and distToTSS which are determined based on the contents of getGeneAnnotation(ArchRProj)$genes and getGeneAnnotation(ArchRProj)$TSS respectively. While it may seem like these should be identical, a single gene can have multiple TSSs and ArchR needs to deal with this multi-mapping issue, which we have done by including both annotations. For example, in ArchR’s default hg38 geneAnnotation, there are 23,274 genes and 49,052 TSSs. One confusing piece of information annotated in the peak set is the idx column which does not correspond to a unique peak identifier. Instead the idx column is actually only unique per-chromosome, so for a peak located on “chr1” with and idx of “5”, this means that this is the 5th peak in sequential order on “chr1” (but there would also be a peak with idx of “5” on “chr2”).