17.3 Peak2GeneLinkage with ArchR
Similar to co-accessibility, ArchR can also identify so-called “peak-to-gene links”. The primary differences between peak-to-gene links and co-accessibility is that co-accessibility is an ATAC-seq-only analysis that looks for correlations in accessibility between two peaks while peak-to-gene linkage leverages integrated scRNA-seq data to look for correlations between peak accessibility and gene expression. These represent orthogonal approaches to a similar problem. However, because peak-to-gene linkage correlates scATAC-seq and scRNA-seq data, we often think of these links as more relevant to gene regulatory interactions.
To identify peak-to-gene links in ArchR, we use the addPeak2GeneLinks()
function.
<- addPeak2GeneLinks(
projHeme5 ArchRProj = projHeme5,
reducedDims = "IterativeLSI"
)## ArchR logging to : ArchRLogs/ArchR-addPeak2GeneLinks-371b06fcb7e2a-Date-2022-12-23_Time-08-48-41.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:48:42 : Getting Available Matrices, 0.016 mins elapsed.
## 2022-12-23 08:48:44 : Filtered Low Prediction Score Cells (1356 of 10250, 0.132), 0.02 mins elapsed.
## 2022-12-23 08:48:46 : Computing KNN, 0.04 mins elapsed.
## 2022-12-23 08:48:46 : Identifying Non-Overlapping KNN pairs, 0.042 mins elapsed.
## 2022-12-23 08:48:47 : Identified 490 Groupings!, 0.068 mins elapsed.
## 2022-12-23 08:48:47 : Getting Group RNA Matrix, 0.069 mins elapsed.
## 2022-12-23 08:49:11 : Getting Group ATAC Matrix, 0.471 mins elapsed.
## 2022-12-23 08:49:36 : Normalizing Group Matrices, 0.879 mins elapsed.
## 2022-12-23 08:49:40 : Finding Peak Gene Pairings, 0.949 mins elapsed.
## 2022-12-23 08:49:40 : Computing Correlations, 0.955 mins elapsed.
## 2022-12-23 08:49:46 : Completed Peak2Gene Correlations!, 1.047 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addPeak2GeneLinks-371b06fcb7e2a-Date-2022-12-23_Time-08-48-41.log
We can then retrieve these peak-to-gene links in a similar fashion to how we retrieved co-accessibility interactions by using the getPeak2GeneLinks()
function. As we saw previously, this function allows for a user-specified cutoff for correlation and resolution for linkages.
<- getPeak2GeneLinks(
p2g ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = FALSE
)
When returnLoops
is set to false, this function returns a DataFrame
object anaolgous to the DataFrame
object returned by getCoAccessibility()
. The primary difference is that the indexes for the scATAC-seq peaks are stored in a column called idxATAC
and the indexes for the scRNA-seq gene are stored in a column called idxRNA
.
p2g## DataFrame with 39363 rows and 6 columns
## idxATAC idxRNA Correlation FDR VarQATAC VarQRNA
## <integer> <integer> <numeric> <numeric> <numeric> <numeric>
## 1 38 2 0.547094 4.01302e-38 0.838919 0.452126
## 2 11 3 0.493587 3.66743e-30 0.922478 0.481211
## 3 51 4 0.452546 5.96334e-25 0.423078 0.364604
## 4 3 6 0.567426 1.50370e-41 0.279235 0.483522
## 5 61 6 0.492354 5.38928e-30 0.903450 0.483522
## ... ... ... ... ... ... ...
## 39359 142443 18588 0.884094 5.73980e-160 0.794596 0.379066
## 39360 142444 18588 0.565647 3.06978e-41 0.327250 0.379066
## 39361 142433 18590 0.533444 5.93603e-36 0.559986 0.958443
## 39362 142443 18590 0.663546 1.42706e-61 0.794596 0.958443
## 39363 142443 18591 0.474466 1.20547e-27 0.794596 0.953981
Here, idxATAC
and idxRNA
refer to the row indices of the peak or gene in the corresponding geneSet
or peakSet
which can be accessed via the metadata component of the p2g
object.
metadata(p2g)
## $peakSet
## GRanges object with 142475 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 752494-752994 *
## [2] chr1 762696-763196 *
## [3] chr1 801002-801502 *
## [4] chr1 805065-805565 *
## [5] chr1 845326-845826 *
## ... ... ... ...
## [142471] chrX 154493043-154493543 *
## [142472] chrX 154493558-154494058 *
## [142473] chrX 154807254-154807754 *
## [142474] chrX 154842383-154842883 *
## [142475] chrX 154996993-154997493 *
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
##
## $geneSet
## GRanges object with 18601 ranges and 2 metadata columns:
## seqnames ranges strand | name idx
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 69091 * | OR4F5 1
## [2] chr1 762902 * | LINC00115 2
## [3] chr1 812182 * | FAM41C 3
## [4] chr1 860530 * | SAMD11 4
## [5] chr1 894679 * | NOC2L 5
## ... ... ... ... . ... ...
## [18597] chrX 154299695 * | BRCC3 708
## [18598] chrX 154444701 * | VBP1 709
## [18599] chrX 154493852 * | RAB39B 710
## [18600] chrX 154563986 * | CLIC2 711
## [18601] chrX 154842622 * | TMLHE 712
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
##
## $seATAC
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seATAC-Group-KNN.rds"
##
## $seRNA
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seRNA-Group-KNN.rds"
It is important not to confuse idxRNA
with the value in the idx
column of the geneSet
. Within the geneSet
, idx
corresponds to the chronological position of the given gene across each chromosome. As such, idx
is only unique across an individual chromosome and is not relevant for mapping idxRNA
to a gene name.
So, if we wanted to add the gene name and peak coordinates to our p2g
DataFrame
object, we would do the following:
$geneName <- mcols(metadata(p2g)$geneSet)$name[p2g$idxRNA]
p2g$peakName <- (metadata(p2g)$peakSet %>% {paste0(seqnames(.), "_", start(.), "_", end(.))})[p2g$idxATAC]
p2g
p2g## DataFrame with 39363 rows and 8 columns
## idxATAC idxRNA Correlation FDR VarQATAC VarQRNA
## <integer> <integer> <numeric> <numeric> <numeric> <numeric>
## 1 38 2 0.547094 4.01302e-38 0.838919 0.452126
## 2 11 3 0.493587 3.66743e-30 0.922478 0.481211
## 3 51 4 0.452546 5.96334e-25 0.423078 0.364604
## 4 3 6 0.567426 1.50370e-41 0.279235 0.483522
## 5 61 6 0.492354 5.38928e-30 0.903450 0.483522
## ... ... ... ... ... ... ...
## 39359 142443 18588 0.884094 5.73980e-160 0.794596 0.379066
## 39360 142444 18588 0.565647 3.06978e-41 0.327250 0.379066
## 39361 142433 18590 0.533444 5.93603e-36 0.559986 0.958443
## 39362 142443 18590 0.663546 1.42706e-61 0.794596 0.958443
## 39363 142443 18591 0.474466 1.20547e-27 0.794596 0.953981
## geneName peakName
## <character> <character>
## 1 LINC00115 chr1_935257_935757
## 2 FAM41C chr1_856376_856876
## 3 SAMD11 chr1_967727_968227
## 4 KLHL17 chr1_801002_801502
## 5 KLHL17 chr1_1003932_1004432
## ... ... ...
## 39359 CTAG2 chrX_153979965_15398..
## 39360 CTAG2 chrX_153990114_15399..
## 39361 DKC1 chrX_153946099_15394..
## 39362 DKC1 chrX_153979965_15398..
## 39363 MPP1 chrX_153979965_15398..
You may also notice that there is other information stored within the metadata()
of p2g
including pointers to files called seATAC
and seRNA
which represent SummarizedExperiment
objects for the ATAC-seq and RNA-seq data used to identify peak-to-gene linkages and these each include both the raw and normalized data matrices used in the analysis.
metadata(p2g)$seATAC
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seATAC-Group-KNN.rds"
metadata(p2g)$seRNA
## [1] "/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme4/Peak2GeneLinks/seRNA-Group-KNN.rds"
If we set returnLoops = TRUE
, then getPeak2GeneLinks()
will return a loop track GRanges
object that connects the peak and gene. As for co-accessibility, the start and end of the IRanges
object represent the position of the peak and gene being linked. When resolution = 1
, this links the center of the peak to the single-base TSS of the gene.
<- getPeak2GeneLinks(
p2g ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = TRUE
)
1]]
p2g[[## GRanges object with 39316 ranges and 2 metadata columns:
## seqnames ranges strand | value FDR
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 762902-935507 * | 0.547094 4.01302e-38
## [2] chr1 801252-895967 * | 0.567426 1.50370e-41
## [3] chr1 812182-856626 * | 0.493587 3.66743e-30
## [4] chr1 860530-967977 * | 0.452546 5.96334e-25
## [5] chr1 894694-948847 * | 0.477327 5.18016e-28
## ... ... ... ... . ... ...
## [39312] chrX 153881853-153980215 * | 0.884094 5.73980e-160
## [39313] chrX 153881853-153990364 * | 0.565647 3.06978e-41
## [39314] chrX 153946349-153991031 * | 0.533444 5.93603e-36
## [39315] chrX 153980215-153991031 * | 0.663546 1.42706e-61
## [39316] chrX 153980215-154033802 * | 0.474466 1.20547e-27
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
We can alternatively decrease the resolution of these links by setting resolution = 1000
. This is primarily useful for plotting the links as a browser tracks because there are instances where many nearby peaks all link to the same gene and this can be difficult to visualize.
<- getPeak2GeneLinks(
p2g ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1000,
returnLoops = TRUE
)
1]]
p2g[[## GRanges object with 37944 ranges and 2 metadata columns:
## seqnames ranges strand | value FDR
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 762500-935500 * | 0.547094 4.01302e-38
## [2] chr1 801500-895500 * | 0.567426 1.50370e-41
## [3] chr1 812500-856500 * | 0.493587 3.66743e-30
## [4] chr1 860500-967500 * | 0.452546 5.96334e-25
## [5] chr1 894500-948500 * | 0.477327 5.18016e-28
## ... ... ... ... . ... ...
## [37940] chrX 153881500-153980500 * | 0.884094 5.73980e-160
## [37941] chrX 153881500-153990500 * | 0.565647 3.06978e-41
## [37942] chrX 153946500-153991500 * | 0.533444 5.93603e-36
## [37943] chrX 153980500-153991500 * | 0.663546 1.42706e-61
## [37944] chrX 153980500-154033500 * | 0.474466 1.20547e-27
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
Decreasing the resolution even further also decreases the total number of peak-to-gene links identified.
<- getPeak2GeneLinks(
p2g ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 10000,
returnLoops = TRUE
)
1]]
p2g[[## GRanges object with 30685 ranges and 2 metadata columns:
## seqnames ranges strand | value FDR
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 765000-935000 * | 0.547094 4.01302e-38
## [2] chr1 805000-895000 * | 0.567426 1.50370e-41
## [3] chr1 815000-855000 * | 0.493587 3.66743e-30
## [4] chr1 865000-965000 * | 0.452546 5.96334e-25
## [5] chr1 895000-905000 * | 0.540529 4.57253e-37
## ... ... ... ... . ... ...
## [30681] chrX 153885000-153985000 * | 0.884094 5.73980e-160
## [30682] chrX 153885000-153995000 * | 0.565647 3.06978e-41
## [30683] chrX 153945000-153995000 * | 0.533444 5.93603e-36
## [30684] chrX 153985000-153995000 * | 0.663546 1.42706e-61
## [30685] chrX 153985000-154035000 * | 0.474466 1.20547e-27
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
You may find that you would like to use a statistical approach to evaluate peak-to-gene associations. Such an approach was developed in Regner et al. 2021 and involves using permutation testing to empirically derive an FDR. This approach is quite computationally expensive but has been implemented as an option in addPeak2GeneLinks()
via the addPermutedPval
parameter. For more information on this approach, see this discussion post.
17.3.1 Plotting browser tracks with peak-to-gene links
To plot these peak-to-gene links as a browser track, we use the same workflow shown for co-accessibility in the previous section. Here we use the plotBrowserTrack()
function
<- c(
markerGenes "CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
<- plotBrowserTrack(
p ArchRProj = projHeme5,
groupBy = "Clusters2",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getPeak2GeneLinks(projHeme5)
)## ArchR logging to : ArchRLogs/ArchR-plotBrowserTrack-371b03102842a-Date-2022-12-23_Time-08-49-47.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:49:48 : Validating Region, 0.016 mins elapsed.
## GRanges object with 9 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 208059883-208084683 - | 947 CD34
## [2] chrX 48644982-48652717 + | 2623 GATA1
## [3] chr9 36838531-37034476 - | 5079 PAX5
## [4] chr11 60223282-60238225 + | 931 MS4A1
## [5] chr5 140011313-140013286 - | 929 CD14
## [6] chr11 118209789-118213459 - | 915 CD3D
## [7] chr2 87011728-87035519 - | 925 CD8A
## [8] chr17 45810610-45823485 + | 30009 TBX21
## [9] chr5 35856977-35879705 + | 3575 IL7R
## -------
## seqinfo: 24 sequences from hg19 genome
## 2022-12-23 08:49:48 : Adding Bulk Tracks (1 of 9), 0.018 mins elapsed.
## 2022-12-23 08:49:50 : Adding Feature Tracks (1 of 9), 0.048 mins elapsed.
## 2022-12-23 08:49:50 : Adding Loop Tracks (1 of 9), 0.049 mins elapsed.
## 2022-12-23 08:49:50 : Adding Gene Tracks (1 of 9), 0.051 mins elapsed.
## 2022-12-23 08:49:50 : Plotting, 0.054 mins elapsed.
## 2022-12-23 08:49:52 : Adding Bulk Tracks (2 of 9), 0.086 mins elapsed.
## 2022-12-23 08:49:54 : Adding Feature Tracks (2 of 9), 0.114 mins elapsed.
## 2022-12-23 08:49:54 : Adding Loop Tracks (2 of 9), 0.115 mins elapsed.
## 2022-12-23 08:49:54 : Adding Gene Tracks (2 of 9), 0.117 mins elapsed.
## 2022-12-23 08:49:54 : Plotting, 0.12 mins elapsed.
## 2022-12-23 08:49:56 : Adding Bulk Tracks (3 of 9), 0.146 mins elapsed.
## 2022-12-23 08:49:57 : Adding Feature Tracks (3 of 9), 0.174 mins elapsed.
## 2022-12-23 08:49:57 : Adding Loop Tracks (3 of 9), 0.175 mins elapsed.
## 2022-12-23 08:49:58 : Adding Gene Tracks (3 of 9), 0.179 mins elapsed.
## 2022-12-23 08:49:58 : Plotting, 0.182 mins elapsed.
## 2022-12-23 08:50:00 : Adding Bulk Tracks (4 of 9), 0.212 mins elapsed.
## 2022-12-23 08:50:01 : Adding Feature Tracks (4 of 9), 0.241 mins elapsed.
## 2022-12-23 08:50:01 : Adding Loop Tracks (4 of 9), 0.242 mins elapsed.
## 2022-12-23 08:50:02 : Adding Gene Tracks (4 of 9), 0.244 mins elapsed.
## 2022-12-23 08:50:02 : Plotting, 0.247 mins elapsed.
## 2022-12-23 08:50:03 : Adding Bulk Tracks (5 of 9), 0.268 mins elapsed.
## 2022-12-23 08:50:05 : Adding Feature Tracks (5 of 9), 0.296 mins elapsed.
## 2022-12-23 08:50:05 : Adding Loop Tracks (5 of 9), 0.297 mins elapsed.
## 2022-12-23 08:50:05 : Adding Gene Tracks (5 of 9), 0.302 mins elapsed.
## 2022-12-23 08:50:05 : Plotting, 0.305 mins elapsed.
## 2022-12-23 08:50:07 : Adding Bulk Tracks (6 of 9), 0.333 mins elapsed.
## 2022-12-23 08:50:09 : Adding Feature Tracks (6 of 9), 0.361 mins elapsed.
## 2022-12-23 08:50:09 : Adding Loop Tracks (6 of 9), 0.362 mins elapsed.
## 2022-12-23 08:50:09 : Adding Gene Tracks (6 of 9), 0.374 mins elapsed.
## 2022-12-23 08:50:10 : Plotting, 0.378 mins elapsed.
## 2022-12-23 08:50:11 : Adding Bulk Tracks (7 of 9), 0.399 mins elapsed.
## 2022-12-23 08:50:13 : Adding Feature Tracks (7 of 9), 0.428 mins elapsed.
## 2022-12-23 08:50:13 : Adding Loop Tracks (7 of 9), 0.429 mins elapsed.
## 2022-12-23 08:50:13 : Adding Gene Tracks (7 of 9), 0.437 mins elapsed.
## 2022-12-23 08:50:13 : Plotting, 0.44 mins elapsed.
## 2022-12-23 08:50:15 : Adding Bulk Tracks (8 of 9), 0.463 mins elapsed.
## 2022-12-23 08:50:17 : Adding Feature Tracks (8 of 9), 0.496 mins elapsed.
## 2022-12-23 08:50:17 : Adding Loop Tracks (8 of 9), 0.497 mins elapsed.
## 2022-12-23 08:50:17 : Adding Gene Tracks (8 of 9), 0.501 mins elapsed.
## 2022-12-23 08:50:17 : Plotting, 0.504 mins elapsed.
## 2022-12-23 08:50:19 : Adding Bulk Tracks (9 of 9), 0.535 mins elapsed.
## 2022-12-23 08:50:21 : Adding Feature Tracks (9 of 9), 0.563 mins elapsed.
## 2022-12-23 08:50:21 : Adding Loop Tracks (9 of 9), 0.564 mins elapsed.
## 2022-12-23 08:50:21 : Adding Gene Tracks (9 of 9), 0.567 mins elapsed.
## 2022-12-23 08:50:21 : Plotting, 0.571 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-plotBrowserTrack-371b03102842a-Date-2022-12-23_Time-08-49-47.log
To plot our browser track we use the grid.draw
function and select a specific marker gene by name using the $
accessor.
::grid.newpage()
grid::grid.draw(p$CD14) grid
To save an editable vectorized version of this plot, we use plotPDF()
.
plotPDF(plotList = p,
name = "Plot-Tracks-Marker-Genes-with-Peak2GeneLinks.pdf",
ArchRProj = projHeme5,
addDOC = FALSE, width = 5, height = 5)
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
## Plotting Gtable!
## NULL
17.3.2 Plotting a heatmap of peak-to-gene links
To visualize the correspondence of all of our peak-to-gene links, we can plot a peak-to-gene heatmap which contains two side-by-side heatmaps, one for our scATAC-seq data and one for our scRNA-seq data. To do this, we use the plotPeak2GeneHeatmap()
<- plotPeak2GeneHeatmap(ArchRProj = projHeme5, groupBy = "Clusters2")
p ## ArchR logging to : ArchRLogs/ArchR-plotPeak2GeneHeatmap-371b050dd435a-Date-2022-12-23_Time-08-50-27.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:50:32 : Determining KNN Groups!, 0.085 mins elapsed.
## 2022-12-23 08:50:34 : Ordering Peak2Gene Links!, 0.123 mins elapsed.
## 2022-12-23 08:50:43 : Constructing ATAC Heatmap!, 0.282 mins elapsed.
## Adding Annotations..
## Preparing Main Heatmap..
## 'magick' package is suggested to install to give better rasterization.
##
## Set `ht_opt$message = FALSE` to turn off this message.
## 2022-12-23 08:50:44 : Constructing RNA Heatmap!, 0.286 mins elapsed.
## Adding Annotations..
## Preparing Main Heatmap..
## 'magick' package is suggested to install to give better rasterization.
##
## Set `ht_opt$message = FALSE` to turn off this message.
## ArchR logging successful to : ArchRLogs/ArchR-plotPeak2GeneHeatmap-371b050dd435a-Date-2022-12-23_Time-08-50-27.log
The heatmap rows are clustered using k-means clustering based on the value passed to the parameter k
, which defaults to 25 as shown below.
p