17.3 Peak2GeneLinkage with ArchR

Similar to co-accessibility, ArchR can also identify so-called “peak-to-gene links”. The primary differences between peak-to-gene links and co-accessibility is that co-accessibility is an ATAC-seq-only analysis that looks for correlations in accessibility between two peaks while peak-to-gene linkage leverages integrated scRNA-seq data to look for correlations between peak accessibility and gene expression. These represent orthogonal approaches to a similar problem. However, because peak-to-gene linkage correlates scATAC-seq and scRNA-seq data, we often think of these links as more relevant to gene regulatory interactions.

To identify peak-to-gene links in ArchR, we use the addPeak2GeneLinks() function.

projHeme5 <- addPeak2GeneLinks(
    ArchRProj = projHeme5,
    reducedDims = "IterativeLSI"
)
## ArchR logging to : ArchRLogs/ArchR-addPeak2GeneLinks-371b06fcb7e2a-Date-2022-12-23_Time-08-48-41.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:48:42 : Getting Available Matrices, 0.016 mins elapsed.
## 2022-12-23 08:48:44 : Filtered Low Prediction Score Cells (1356 of 10250, 0.132), 0.02 mins elapsed.
## 2022-12-23 08:48:46 : Computing KNN, 0.04 mins elapsed.
## 2022-12-23 08:48:46 : Identifying Non-Overlapping KNN pairs, 0.042 mins elapsed.
## 2022-12-23 08:48:47 : Identified 490 Groupings!, 0.068 mins elapsed.
## 2022-12-23 08:48:47 : Getting Group RNA Matrix, 0.069 mins elapsed.
## 2022-12-23 08:49:11 : Getting Group ATAC Matrix, 0.471 mins elapsed.
## 2022-12-23 08:49:36 : Normalizing Group Matrices, 0.879 mins elapsed.
## 2022-12-23 08:49:40 : Finding Peak Gene Pairings, 0.949 mins elapsed.
## 2022-12-23 08:49:40 : Computing Correlations, 0.955 mins elapsed.
## 2022-12-23 08:49:46 : Completed Peak2Gene Correlations!, 1.047 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addPeak2GeneLinks-371b06fcb7e2a-Date-2022-12-23_Time-08-48-41.log

We can then retrieve these peak-to-gene links in a similar fashion to how we retrieved co-accessibility interactions by using the getPeak2GeneLinks() function. As we saw previously, this function allows for a user-specified cutoff for correlation and resolution for linkages.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = FALSE
)

When returnLoops is set to false, this function returns a DataFrame object anaolgous to the DataFrame object returned by getCoAccessibility(). The primary difference is that the indexes for the scATAC-seq peaks are stored in a column called idxATAC and the indexes for the scRNA-seq gene are stored in a column called idxRNA.

p2g
## DataFrame with 39363 rows and 6 columns
##         idxATAC    idxRNA Correlation          FDR  VarQATAC   VarQRNA
##       <integer> <integer>   <numeric>    <numeric> <numeric> <numeric>
## 1            38         2    0.547094  4.01302e-38  0.838919  0.452126
## 2            11         3    0.493587  3.66743e-30  0.922478  0.481211
## 3            51         4    0.452546  5.96334e-25  0.423078  0.364604
## 4             3         6    0.567426  1.50370e-41  0.279235  0.483522
## 5            61         6    0.492354  5.38928e-30  0.903450  0.483522
## ...         ...       ...         ...          ...       ...       ...
## 39359    142443     18588    0.884094 5.73980e-160  0.794596  0.379066
## 39360    142444     18588    0.565647  3.06978e-41  0.327250  0.379066
## 39361    142433     18590    0.533444  5.93603e-36  0.559986  0.958443
## 39362    142443     18590    0.663546  1.42706e-61  0.794596  0.958443
## 39363    142443     18591    0.474466  1.20547e-27  0.794596  0.953981

Here, idxATAC and idxRNA refer to the row indices of the peak or gene in the corresponding geneSet or peakSet which can be accessed via the metadata component of the p2g object.

metadata(p2g)
## $peakSet
## GRanges object with 142475 ranges and 0 metadata columns:
##            seqnames              ranges strand
##               <Rle>           <IRanges>  <Rle>
##        [1]     chr1       752494-752994      *
##        [2]     chr1       762696-763196      *
##        [3]     chr1       801002-801502      *
##        [4]     chr1       805065-805565      *
##        [5]     chr1       845326-845826      *
##        ...      ...                 ...    ...
##   [142471]     chrX 154493043-154493543      *
##   [142472]     chrX 154493558-154494058      *
##   [142473]     chrX 154807254-154807754      *
##   [142474]     chrX 154842383-154842883      *
##   [142475]     chrX 154996993-154997493      *
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
## 
## $geneSet
## GRanges object with 18601 ranges and 2 metadata columns:
##           seqnames    ranges strand |        name       idx
##              <Rle> <IRanges>  <Rle> | <character> <integer>
##       [1]     chr1     69091      * |       OR4F5         1
##       [2]     chr1    762902      * |   LINC00115         2
##       [3]     chr1    812182      * |      FAM41C         3
##       [4]     chr1    860530      * |      SAMD11         4
##       [5]     chr1    894679      * |       NOC2L         5
##       ...      ...       ...    ... .         ...       ...
##   [18597]     chrX 154299695      * |       BRCC3       708
##   [18598]     chrX 154444701      * |        VBP1       709
##   [18599]     chrX 154493852      * |      RAB39B       710
##   [18600]     chrX 154563986      * |       CLIC2       711
##   [18601]     chrX 154842622      * |       TMLHE       712
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
## 
## $seATAC
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seATAC-Group-KNN.rds"
## 
## $seRNA
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seRNA-Group-KNN.rds"

It is important not to confuse idxRNA with the value in the idx column of the geneSet. Within the geneSet, idx corresponds to the chronological position of the given gene across each chromosome. As such, idx is only unique across an individual chromosome and is not relevant for mapping idxRNA to a gene name.

So, if we wanted to add the gene name and peak coordinates to our p2g DataFrame object, we would do the following:

p2g$geneName <- mcols(metadata(p2g)$geneSet)$name[p2g$idxRNA]
p2g$peakName <- (metadata(p2g)$peakSet %>% {paste0(seqnames(.), "_", start(.), "_", end(.))})[p2g$idxATAC]
p2g
## DataFrame with 39363 rows and 8 columns
##         idxATAC    idxRNA Correlation          FDR  VarQATAC   VarQRNA
##       <integer> <integer>   <numeric>    <numeric> <numeric> <numeric>
## 1            38         2    0.547094  4.01302e-38  0.838919  0.452126
## 2            11         3    0.493587  3.66743e-30  0.922478  0.481211
## 3            51         4    0.452546  5.96334e-25  0.423078  0.364604
## 4             3         6    0.567426  1.50370e-41  0.279235  0.483522
## 5            61         6    0.492354  5.38928e-30  0.903450  0.483522
## ...         ...       ...         ...          ...       ...       ...
## 39359    142443     18588    0.884094 5.73980e-160  0.794596  0.379066
## 39360    142444     18588    0.565647  3.06978e-41  0.327250  0.379066
## 39361    142433     18590    0.533444  5.93603e-36  0.559986  0.958443
## 39362    142443     18590    0.663546  1.42706e-61  0.794596  0.958443
## 39363    142443     18591    0.474466  1.20547e-27  0.794596  0.953981
##          geneName               peakName
##       <character>            <character>
## 1       LINC00115     chr1_935257_935757
## 2          FAM41C     chr1_856376_856876
## 3          SAMD11     chr1_967727_968227
## 4          KLHL17     chr1_801002_801502
## 5          KLHL17   chr1_1003932_1004432
## ...           ...                    ...
## 39359       CTAG2 chrX_153979965_15398..
## 39360       CTAG2 chrX_153990114_15399..
## 39361        DKC1 chrX_153946099_15394..
## 39362        DKC1 chrX_153979965_15398..
## 39363        MPP1 chrX_153979965_15398..

You may also notice that there is other information stored within the metadata() of p2g including pointers to files called seATAC and seRNA which represent SummarizedExperiment objects for the ATAC-seq and RNA-seq data used to identify peak-to-gene linkages and these each include both the raw and normalized data matrices used in the analysis.

metadata(p2g)$seATAC
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seATAC-Group-KNN.rds"
metadata(p2g)$seRNA
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seRNA-Group-KNN.rds"

If we set returnLoops = TRUE, then getPeak2GeneLinks() will return a loop track GRanges object that connects the peak and gene. As for co-accessibility, the start and end of the IRanges object represent the position of the peak and gene being linked. When resolution = 1, this links the center of the peak to the single-base TSS of the gene.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = TRUE
)

p2g[[1]]
## GRanges object with 39316 ranges and 2 metadata columns:
##           seqnames              ranges strand |     value          FDR
##              <Rle>           <IRanges>  <Rle> | <numeric>    <numeric>
##       [1]     chr1       762902-935507      * |  0.547094  4.01302e-38
##       [2]     chr1       801252-895967      * |  0.567426  1.50370e-41
##       [3]     chr1       812182-856626      * |  0.493587  3.66743e-30
##       [4]     chr1       860530-967977      * |  0.452546  5.96334e-25
##       [5]     chr1       894694-948847      * |  0.477327  5.18016e-28
##       ...      ...                 ...    ... .       ...          ...
##   [39312]     chrX 153881853-153980215      * |  0.884094 5.73980e-160
##   [39313]     chrX 153881853-153990364      * |  0.565647  3.06978e-41
##   [39314]     chrX 153946349-153991031      * |  0.533444  5.93603e-36
##   [39315]     chrX 153980215-153991031      * |  0.663546  1.42706e-61
##   [39316]     chrX 153980215-154033802      * |  0.474466  1.20547e-27
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

We can alternatively decrease the resolution of these links by setting resolution = 1000. This is primarily useful for plotting the links as a browser tracks because there are instances where many nearby peaks all link to the same gene and this can be difficult to visualize.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1000,
    returnLoops = TRUE
)

p2g[[1]]
## GRanges object with 37944 ranges and 2 metadata columns:
##           seqnames              ranges strand |     value          FDR
##              <Rle>           <IRanges>  <Rle> | <numeric>    <numeric>
##       [1]     chr1       762500-935500      * |  0.547094  4.01302e-38
##       [2]     chr1       801500-895500      * |  0.567426  1.50370e-41
##       [3]     chr1       812500-856500      * |  0.493587  3.66743e-30
##       [4]     chr1       860500-967500      * |  0.452546  5.96334e-25
##       [5]     chr1       894500-948500      * |  0.477327  5.18016e-28
##       ...      ...                 ...    ... .       ...          ...
##   [37940]     chrX 153881500-153980500      * |  0.884094 5.73980e-160
##   [37941]     chrX 153881500-153990500      * |  0.565647  3.06978e-41
##   [37942]     chrX 153946500-153991500      * |  0.533444  5.93603e-36
##   [37943]     chrX 153980500-153991500      * |  0.663546  1.42706e-61
##   [37944]     chrX 153980500-154033500      * |  0.474466  1.20547e-27
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

Decreasing the resolution even further also decreases the total number of peak-to-gene links identified.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 10000,
    returnLoops = TRUE
)

p2g[[1]]
## GRanges object with 30685 ranges and 2 metadata columns:
##           seqnames              ranges strand |     value          FDR
##              <Rle>           <IRanges>  <Rle> | <numeric>    <numeric>
##       [1]     chr1       765000-935000      * |  0.547094  4.01302e-38
##       [2]     chr1       805000-895000      * |  0.567426  1.50370e-41
##       [3]     chr1       815000-855000      * |  0.493587  3.66743e-30
##       [4]     chr1       865000-965000      * |  0.452546  5.96334e-25
##       [5]     chr1       895000-905000      * |  0.540529  4.57253e-37
##       ...      ...                 ...    ... .       ...          ...
##   [30681]     chrX 153885000-153985000      * |  0.884094 5.73980e-160
##   [30682]     chrX 153885000-153995000      * |  0.565647  3.06978e-41
##   [30683]     chrX 153945000-153995000      * |  0.533444  5.93603e-36
##   [30684]     chrX 153985000-153995000      * |  0.663546  1.42706e-61
##   [30685]     chrX 153985000-154035000      * |  0.474466  1.20547e-27
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

You may find that you would like to use a statistical approach to evaluate peak-to-gene associations. Such an approach was developed in Regner et al. 2021 and involves using permutation testing to empirically derive an FDR. This approach is quite computationally expensive but has been implemented as an option in addPeak2GeneLinks() via the addPermutedPval parameter. For more information on this approach, see this discussion post.