15.2 ArchR and Custom Deviations

In the chapter on Peak Annotation Enrichment, we introduced how to create peak annotations for any set of genomic regions. This encluded (i) ArchR-supported region sets such as curated TF binding sites from ENCODE and peak sets from bulk ATAC-seq and (ii) custom user-supplied region sets. If you have not read this section, we recommend doing so to better understand how peak annotations work.

These peak annotations can be used in deviation calculations in the same way as motifs. Here we provide examples of how to run these analyses but note that the downstream analyses are identical to what was shown in the previous section for motifs and thus we do not provide extensive details on each step of the code. Once you create a deviations matrix in your Arrow files, the rest is the same.

15.2.1 Encode TFBS

In case you have not added an annotations matrix for the “EncodeTFBS” region set, lets do that now.

if("EncodeTFBS" %ni% names(projHeme5@peakAnnotation)){
    projHeme5 <- addArchRAnnotations(ArchRProj = projHeme5, collection = "EncodeTFBS")
}

Then, we create a deviations matrix, providing this peak annotation to the peakAnnotation parameter.

projHeme5 <- addDeviationsMatrix(
  ArchRProj = projHeme5, 
  peakAnnotation = "EncodeTFBS",
  force = TRUE
)
## Using Previous Background Peaks!
## ArchR logging to : ArchRLogs/ArchR-addDeviationsMatrix-371b073390571-Date-2022-12-23_Time-08-14-12.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:14:18 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ###########
## 2022-12-23 08:21:45 : Completed Computing Deviations!, 7.536 mins elapsed.
## ###########
## ArchR logging successful to : ArchRLogs/ArchR-addDeviationsMatrix-371b073390571-Date-2022-12-23_Time-08-14-12.log

We can create a dot plot of the ranked deviations.

plotVarDev <- getVarDeviations(projHeme5, plot = TRUE, name = "EncodeTFBSMatrix")
## DataFrame with 6 rows and 6 columns
##      seqnames       idx                   name combinedVars combinedMeans
##         <Rle> <integer>            <character>    <numeric>     <numeric>
## f222        z       222    222.GATA2_S-K562...      22.1183   -0.02452491
## f542        z       542    542.TAL1_SC-K562...      19.5092   -0.00656765
## f498        z       498     498.GATA_2-K562...      16.2618   -0.01654390
## f584        z       584 584.GATA_1-PBDEFetal..      15.3336    0.00144969
## f497        z       497     497.GATA_1-K562...      12.9157    0.00417738
## f591        z       591    591.eGFP_GA-K562...      12.6567   -0.01163160
##           rank
##      <integer>
## f222         1
## f542         2
## f498         3
## f584         4
## f497         5
## f591         6
plotVarDev
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

To save an editable vectorized version of this plot, we use the plotPDF() function.

plotPDF(plotVarDev, name = "Variable-EncodeTFBS-Deviation-Scores", width = 5, height = 5, ArchRProj = projHeme5, addDOC = FALSE)
## Plotting Ggplot!

Or we can subset these TF binding sites to particular motifs we are interested in and then plot their deviation z-scores per-cell on our UMAP embedding.

tfs <- c("GATA_1", "CEBPB", "EBF1", "IRF4", "TBX21", "PAX5")
getFeatures(projHeme5, select = paste(tfs, collapse="|"), useMatrix = "EncodeTFBSMatrix")
##  [1] "z:584.GATA_1-PBDEFetal..."          "z:582.GATA_1-PBDE..."              
##  [3] "z:497.GATA_1-K562..."               "z:477.CEBPB-K562..."               
##  [5] "z:462.CEBPB-IMR90..."               "z:427.CEBPB-HepG2..."              
##  [7] "z:426.CEBPB-HepG2..."               "z:379.CEBPB-HeLa_S3..."            
##  [9] "z:344.CEBPB-H1_hESC..."             "z:293.EBF1_SC-GM12878..."          
## [11] "z:278.CEBPB-A549..."                "z:213.CEBPB_S-K562..."             
## [13] "z:173.CEBPB_S-HepG2..."             "z:130.PAX5_C2-GM12892..."          
## [15] "z:123.PAX5_C2-GM12891..."           "z:102.PAX5_N1-GM12878..."          
## [17] "z:101.PAX5_C2-GM12878..."           "z:93.IRF4_SC-GM12878..."           
## [19] "z:87.EBF1_SC-GM12878..."            "z:86.CEBPB_S-GM12878..."           
## [21] "deviations:584.GATA_1-PBDEFetal..." "deviations:582.GATA_1-PBDE..."     
## [23] "deviations:497.GATA_1-K562..."      "deviations:477.CEBPB-K562..."      
## [25] "deviations:462.CEBPB-IMR90..."      "deviations:427.CEBPB-HepG2..."     
## [27] "deviations:426.CEBPB-HepG2..."      "deviations:379.CEBPB-HeLa_S3..."   
## [29] "deviations:344.CEBPB-H1_hESC..."    "deviations:293.EBF1_SC-GM12878..." 
## [31] "deviations:278.CEBPB-A549..."       "deviations:213.CEBPB_S-K562..."    
## [33] "deviations:173.CEBPB_S-HepG2..."    "deviations:130.PAX5_C2-GM12892..." 
## [35] "deviations:123.PAX5_C2-GM12891..."  "deviations:102.PAX5_N1-GM12878..." 
## [37] "deviations:101.PAX5_C2-GM12878..."  "deviations:93.IRF4_SC-GM12878..."  
## [39] "deviations:87.EBF1_SC-GM12878..."   "deviations:86.CEBPB_S-GM12878..."
markerTFs <- getFeatures(projHeme5, select = paste(tfs, collapse="|"), useMatrix = "EncodeTFBSMatrix")
markerTFs <- sort(grep("z:", markerTFs, value = TRUE))
TFnames <- stringr::str_split(stringr::str_split(markerTFs, pattern = "\\.", simplify=TRUE)[,2], pattern = "-", simplify = TRUE)[,1]
markerTFs <- markerTFs[!duplicated(TFnames)]
markerTFs
## [1] "z:101.PAX5_C2-GM12878..." "z:102.PAX5_N1-GM12878..."
## [3] "z:173.CEBPB_S-HepG2..."   "z:278.CEBPB-A549..."     
## [5] "z:293.EBF1_SC-GM12878..." "z:497.GATA_1-K562..."    
## [7] "z:93.IRF4_SC-GM12878..."
p <- plotEmbedding(
    ArchRProj = projHeme5, 
    colorBy = "EncodeTFBSMatrix", 
    name = markerTFs, 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(projHeme5)
)
## Getting ImputeWeights
## No imputeWeights found, returning NULL
## ArchR logging to : ArchRLogs/ArchR-plotEmbedding-371b025ec5cd9-Date-2022-12-23_Time-08-21-52.log
## If there is an issue, please report to github with logFile!
## Getting UMAP Embedding
## ColorBy = EncodeTFBSMatrix
## Getting Matrix Values...
## 2022-12-23 08:21:54 :
## 
## Plotting Embedding
## 1 2 3 4 5 6 7 
## ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-371b025ec5cd9-Date-2022-12-23_Time-08-21-52.log
p2 <- lapply(p, function(x){
    x + guides(color = "none", fill = "none") + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
do.call(cowplot::plot_grid, c(list(ncol = 3),p2))

15.2.2 Bulk ATAC-seq

Similarly, we can use ArchR-curated bulk ATAC-seq peak sets for our motif deviation calculations.If you have not added motif annotations

In case you have not added an annotations matrix for the “EncodeTFBS” region set, lets do that now.

if("ATAC" %ni% names(projHeme5@peakAnnotation)){
    projHeme5 <- addArchRAnnotations(ArchRProj = projHeme5, collection = "ATAC")
}

Then, we create a deviations matrix, providing this peak annotation to the peakAnnotation parameter.

projHeme5 <- addDeviationsMatrix(
  ArchRProj = projHeme5, 
  peakAnnotation = "ATAC",
  force = TRUE
)
## Using Previous Background Peaks!
## ArchR logging to : ArchRLogs/ArchR-addDeviationsMatrix-371b0688994cb-Date-2022-12-23_Time-08-22-03.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:22:09 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ###########
## 2022-12-23 08:26:49 : Completed Computing Deviations!, 4.77 mins elapsed.
## ###########
## ArchR logging successful to : ArchRLogs/ArchR-addDeviationsMatrix-371b0688994cb-Date-2022-12-23_Time-08-22-03.log

We can create a dot plot of the ranked deviations.

plotVarDev <- getVarDeviations(projHeme5, plot = TRUE, name = "ATACMatrix")
## DataFrame with 6 rows and 6 columns
##     seqnames       idx                   name combinedVars combinedMeans
##        <Rle> <integer>            <character>    <numeric>     <numeric>
## f22        z        22  IAtlas_T_CD8posCenMem      14.6856     -0.180982
## f24        z        24  IAtlas_T_FollicHelper      14.2771     -0.152118
## f21        z        21        IAtlas_T_CD8pos      14.0959     -0.172118
## f19        z        19 IAtlas_T_CD4posEffec..      14.0040     -0.154975
## f33        z        33  IAtlas_T_Th1Precursor      13.9651     -0.168027
## f26        z        26    IAtlas_T_MemoryTeff      13.9117     -0.163131
##          rank
##     <integer>
## f22         1
## f24         2
## f21         3
## f19         4
## f33         5
## f26         6
plotVarDev
## Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

To save an editable vectorized version of this plot, we use the plotPDF() function.

plotPDF(plotVarDev, name = "Variable-ATAC-Deviation-Scores", width = 5, height = 5, ArchRProj = projHeme5, addDOC = FALSE)
## Plotting Ggplot!

Or we can plot the deviation z-scores for each of these peak sets per-cell on our UMAP embedding.

ATACPeaks <- c("Heme_HSC", "Heme_LMPP", "Heme_Ery", "Heme_Mono", "Heme_CD4", "Heme_CD8", "Heme_B", "Heme_NK", "IAtlas_DC_Plasmacytoid")
markerATAC <- getFeatures(projHeme5, select = paste(ATACPeaks, collapse="|"), useMatrix = "ATACMatrix")
markerATAC <- sort(grep("z:", markerATAC, value = TRUE))
markerATAC
## [1] "z:Heme_B"                 "z:Heme_CD4"              
## [3] "z:Heme_CD8"               "z:Heme_Ery"              
## [5] "z:Heme_HSC"               "z:Heme_LMPP"             
## [7] "z:Heme_Mono"              "z:Heme_NK"               
## [9] "z:IAtlas_DC_Plasmacytoid"
p <- plotEmbedding(
    ArchRProj = projHeme5, 
    colorBy = "ATACMatrix", 
    name = markerATAC, 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(projHeme5)
)
## Getting ImputeWeights
## No imputeWeights found, returning NULL
## ArchR logging to : ArchRLogs/ArchR-plotEmbedding-371b04807b084-Date-2022-12-23_Time-08-26-56.log
## If there is an issue, please report to github with logFile!
## Getting UMAP Embedding
## ColorBy = ATACMatrix
## Getting Matrix Values...
## 2022-12-23 08:26:58 :
## 
## Plotting Embedding
## 1 2 3 4 5 6 7 8 9 
## ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-371b04807b084-Date-2022-12-23_Time-08-26-56.log
p2 <- lapply(p, function(x){
    x + guides(color = "none", fill = "none") + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
do.call(cowplot::plot_grid, c(list(ncol = 3),p2))

15.2.3 Custom Deviations

Instead of using the ArchR-curated region sets described above, we can provide our own custom region sets as a peak annotation. These custom annotations can be used in exactly the same way as the ArchR-curated annotations.

First, in case you haven’t already created this “EncodePeaks” annotation in a previous chapter, lets create it by downloading some ENCODE peak sets and calling addPeakAnnotations().

EncodePeaks <- c(
  Encode_K562_GATA1 = "https://www.encodeproject.org/files/ENCFF632NQI/@@download/ENCFF632NQI.bed.gz",
  Encode_GM12878_CEBPB = "https://www.encodeproject.org/files/ENCFF761MGJ/@@download/ENCFF761MGJ.bed.gz",
  Encode_K562_Ebf1 = "https://www.encodeproject.org/files/ENCFF868VSY/@@download/ENCFF868VSY.bed.gz",
  Encode_K562_Pax5 = "https://www.encodeproject.org/files/ENCFF339KUO/@@download/ENCFF339KUO.bed.gz"
)
if("ChIP" %ni% names(projHeme5@peakAnnotation)){
    projHeme5 <- addPeakAnnotations(ArchRProj = projHeme5, regions = EncodePeaks, name = "ChIP")
}

Then, we make a deviations matrix from this peak annotation.

projHeme5 <- addDeviationsMatrix(
  ArchRProj = projHeme5, 
  peakAnnotation = "ChIP",
  force = TRUE
)
## Using Previous Background Peaks!
## ArchR logging to : ArchRLogs/ArchR-addDeviationsMatrix-371b01936d733-Date-2022-12-23_Time-08-27-08.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 08:27:14 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ###########
## 2022-12-23 08:27:29 : Completed Computing Deviations!, 0.353 mins elapsed.
## ###########
## ArchR logging successful to : ArchRLogs/ArchR-addDeviationsMatrix-371b01936d733-Date-2022-12-23_Time-08-27-08.log

The rest of the analysis workflow is the same as what has now been presented multiple times above.

We can plot the ranked deviations.

plotVarDev <- getVarDeviations(projHeme5, plot = TRUE, name = "ChIPMatrix")
## DataFrame with 4 rows and 6 columns
##    seqnames       idx                 name combinedVars combinedMeans      rank
##       <Rle> <integer>          <character>    <numeric>     <numeric> <integer>
## f1        z         1    Encode_K562_GATA1     11.79756   -0.00871657         1
## f3        z         3     Encode_K562_Ebf1      4.83049    0.04633747         2
## f4        z         4     Encode_K562_Pax5      2.74405   -0.01775454         3
## f2        z         2 Encode_GM12878_CEBPB      1.39420   -0.00447118         4
plotVarDev
## Warning: Removed 21 rows containing missing values (geom_label_repel).

To save an editable vectorized version of this plot, we use the plotPDF() function.

plotPDF(plotVarDev, name = "Variable-ChIP-Deviation-Scores", width = 5, height = 5, ArchRProj = projHeme5, addDOC = FALSE)
## Plotting Ggplot!

And we can plot the deviation z-scores overlayed on our UMAP embedding.

markerChIP <- getFeatures(projHeme5, useMatrix = "ChIPMatrix")
markerChIP <- sort(grep("z:", markerChIP, value = TRUE))
markerChIP
## [1] "z:Encode_GM12878_CEBPB" "z:Encode_K562_Ebf1"     "z:Encode_K562_GATA1"   
## [4] "z:Encode_K562_Pax5"
p <- plotEmbedding(
    ArchRProj = projHeme5, 
    colorBy = "ChIPMatrix", 
    name = markerChIP, 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(projHeme5)
)
## Getting ImputeWeights
## No imputeWeights found, returning NULL
## ArchR logging to : ArchRLogs/ArchR-plotEmbedding-371b05ee70cac-Date-2022-12-23_Time-08-27-36.log
## If there is an issue, please report to github with logFile!
## Getting UMAP Embedding
## ColorBy = ChIPMatrix
## Getting Matrix Values...
## 2022-12-23 08:27:38 :
## 
## Plotting Embedding
## 1 2 3 4 
## ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-371b05ee70cac-Date-2022-12-23_Time-08-27-36.log
p2 <- lapply(p, function(x){
    x + guides(color = "none", fill = "none") + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
do.call(cowplot::plot_grid, c(list(ncol = 2),p2))