14.1 Motif Enrichment

Continuing our analysis of differential peaks from the previous chapter, we can look for motifs that are enriched in peaks that are up or down in various cell types. To do this, we must first add these motif annotations to our ArchRProject. This effectively creates a binary matrix where the presence of a motif in each peak is indicated numerically. We do this using the addMotifAnnotations() function which determines motif presence in the peak set stored in the ArchRProject. This function allows you to use motif sets that have been curated from repositories such as JASPAR or CIS-BP. For JASPAR, ArchR retrieves these motif sets using the TFBSTools::getMatrixSet() function. For CIS-BP, ArchR retrieves motifs using the chromVARmotifs package. Because of this, certain species may or may not be available for any given motif set and you should familiarize yourself with the addMotifAnnotations() function docs if you are using anything other than human or mouse. To provide maximal flexibility, we allow users to provide their own motif set as a PWMatrixList object to the motifPWMs parameter. If you need to construct your own PWMatrixList object, check out the universalmotif package which provides for simplified conversion between various motif file formats (e.g. meme) and the PWMatrixList format from the TFBSTools package.

projHeme5 <- addMotifAnnotations(ArchRProj = projHeme5, motifSet = "cisbp", name = "Motif")
## ArchR logging to : ArchRLogs/ArchR-addMotifAnnotations-371b065f9506e-Date-2022-12-23_Time-07-55-20.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 07:55:21 : Getting Motif Set, Species : Homo sapiens, 0.019 mins elapsed.
## Using version 2 motifs!
## 2022-12-23 07:55:22 : Finding Motif Positions with motifmatchr!, 0.049 mins elapsed.
## 2022-12-23 07:58:33 : All Motifs Overlap at least 1 peak!, 3.222 mins elapsed.
## 2022-12-23 07:58:33 : Creating Motif Overlap Matrix, 3.222 mins elapsed.
## 2022-12-23 07:58:35 : Finished Getting Motif Info!, 3.254 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addMotifAnnotations-371b065f9506e-Date-2022-12-23_Time-07-55-20.log

The motifSet used should be chosen carefully because annotations may not be available for your species.

14.1.1 Which motifs are found in a given peak?

Before diving into how to perform various motif enrichment analyses, we’ll take a slight detour to discuss how to figure out which motifs are present in a given peak. Lets say we are interested in understanding which TF motifs are present within the promoter peak of the CEBPA gene, located at chr19:33792929-33794030. First, we extract the peak set from our ArchRProject and create names for each peak based on the chromosome, start, and end positions.

pSet <- getPeakSet(ArchRProj = projHeme5)
pSet$name <- paste(seqnames(pSet), start(pSet), end(pSet), sep = "_")

Then, we extract a RangedSummarizedExperiment object containing the per-peak motif matches, create analogous names for that object, and then sort the matches object to ensure that the peaks are listed in the same order as the peaks from the peak set.

matches <- getMatches(ArchRProj = projHeme5, name = "Motif")
rownames(matches) <- paste(seqnames(matches), start(matches), end(matches), sep = "_")
matches <- matches[pSet$name]

Now, we just need to find the index of the CEBPA promoter peak of interest and find which motifs are present in that peak. Note that the below code only works when a single peak is being analyzed.

First, we make a GRanges object corresponding to the CEBPA promoter.

gr <- GRanges(seqnames = c("chr19"), ranges = IRanges(start = c(33792929), end = c(33794030)))

Then we find peaks from our peak set that overlap this region.

queryHits <- queryHits(findOverlaps(query = pSet, subject = gr, type = "within"))

Then we get the TFs that have motif matches in that peak.

colnames(matches)[which(assay(matches[queryHits,]))]
##  [1] "TFAP2D_2"    "HIF1A_24"    "TCFL5_25"    "HES1_34"     "TCF15_46"   
##  [6] "ATOH8_71"    "SCXB_93"     "HES4_95"     "SCXA_96"     "ZFX_158"    
## [11] "KLF6_165"    "ZFY_166"     "KLF5_175"    "ZNF423_176"  "CTCF_177"   
## [16] "SP4_180"     "KLF7_189"    "PLAGL1_190"  "EGR1_195"    "CTCFL_198"  
## [21] "SNAI1_199"   "GLIS2_201"   "KLF16_205"   "EGR4_207"    "ZNF148_222" 
## [26] "SP8_226"     "ZNF219_229"  "ZNF143_231"  "SP2_232"     "SP7_241"    
## [31] "SP3_247"     "GLIS1_250"   "ZBTB7A_258"  "ZBTB7C_265"  "WT1_266"    
## [36] "SP1_267"     "SP6_275"     "SP5_279"     "SP9_283"     "DNMT1_301"  
## [41] "ELF4_323"    "SPDEF_330"   "EHF_333"     "MBD2_644"    "MECP2_645"  
## [46] "TBP_793"     "EP300_804"   "NRF1_805"    "KLF2_846"    "ZFP161_850" 
## [51] "TFCP2L1_858" "SMAD5_866"

The same type of workflow could be used in conjunction with peak-to-gene links or co-accessibility, which will come up in later chapters, to identify TF motifs present in other sets of peaks. The key is matching peaks based on the chromosome, start and end positions as shown above.

14.1.2 Motif Enrichment in Differential Peaks

We can then use the differential testing SummarizedExperiment object markerTest which was generated in the previous chapter to define the set of significantly differential peaks that we are interested in testing for motif enrichment. In this case, we are looking for peaks that have an FDR <= 0.1 and a Log2FC >= 0.5. In the context of the differential comparison made in markerTest, these represnt peaks that are more accessible in “Erythroid” cells than “Progenitor” cells. We can test these differentially accessible peaks for enrichment of various motifs using the peakAnnoEnrichment() function. This function is a generalizable function that can be used for many different enrichment tests as we will demonstrate throughout this chapter.

motifsUp <- peakAnnoEnrichment(
    seMarker = markerTest,
    ArchRProj = projHeme5,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
  )
## ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-371b01b4e03b9-Date-2022-12-23_Time-07-58-38.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 07:58:43 : Computing Enrichments 1 of 1, 0.09 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-371b01b4e03b9-Date-2022-12-23_Time-07-58-38.log

The output of peakAnnoEnrichment() is a SummarizedExperiment object containing multiple assays that store the results of enrichment testing with the hypergeometric test.

motifsUp
## class: SummarizedExperiment 
## dim: 870 1 
## metadata(0):
## assays(10): mlog10Padj mlog10p ... CompareFrequency feature
## rownames(870): TFAP2B_1 TFAP2D_2 ... TBX18_869 TBX22_870
## rowData names(0):
## colnames(1): Erythroid
## colData names(0):

To prepare this data for plotting with ggplot we can create a simplified data.frame object containing the motif names, the corrected p-values, and the significance rank.

df <- data.frame(TF = rownames(motifsUp), mlog10Padj = assay(motifsUp)[,1])
df <- df[order(df$mlog10Padj, decreasing = TRUE),]
df$rank <- seq_len(nrow(df))

As expected, the most enriched motifs in the peaks that are more accessible in “Erythroid” cells correspond to GATA transcription factors, consistent with the well-studied role of GATA1 in erythroid differentiation.

head(df)
##            TF mlog10Padj rank
## 384 GATA3_384   600.5159    1
## 383 GATA1_383   597.0116    2
## 388 GATA2_388   591.7144    3
## 385 GATA5_385   464.4175    4
## 386 GATA4_386   346.8489    5
## 387 GATA6_387   253.4357    6

Using ggplot we can plot the rank-sorted TF motifs and color them by the significance of their enrichment. Here we use ggrepel to label each TF motif.

ggUp <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(P-adj) Motif Enrichment") + 
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))

ggUp
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

We can perform the same analyses for the peaks that are more accessible in the “Progenitor” cells by using peaks with Log2FC <= -0.5.

motifsDo <- peakAnnoEnrichment(
    seMarker = markerTest,
    ArchRProj = projHeme5,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC <= -0.5"
  )
## ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-371b014d17609-Date-2022-12-23_Time-07-58-44.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 07:58:49 : Computing Enrichments 1 of 1, 0.091 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-371b014d17609-Date-2022-12-23_Time-07-58-44.log
motifsDo
## class: SummarizedExperiment 
## dim: 870 1 
## metadata(0):
## assays(10): mlog10Padj mlog10p ... CompareFrequency feature
## rownames(870): TFAP2B_1 TFAP2D_2 ... TBX18_869 TBX22_870
## rowData names(0):
## colnames(1): Erythroid
## colData names(0):
df <- data.frame(TF = rownames(motifsDo), mlog10Padj = assay(motifsDo)[,1])
df <- df[order(df$mlog10Padj, decreasing = TRUE),]
df$rank <- seq_len(nrow(df))

In this case, the most enriched motifs in the peaks that are more accessible in “Progenitor” cells correspond to RUNX, ELF, and CBFB motifs.

head(df)
##            TF mlog10Padj rank
## 326  ELF2_326  105.93806    1
## 56   TCF12_56   80.50256    2
## 733 RUNX1_733   73.70756    3
## 801  CBFB_801   65.90006    4
## 336  SPIB_336   62.57146    5
## 41    MYOG_41   56.55206    6
ggDo <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(FDR) Motif Enrichment") +
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))

ggDo
## Warning: ggrepel: 18 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

To save an editable vectorized version of these plots, we use the plotPDF() function.

plotPDF(ggUp, ggDo, name = "Erythroid-vs-Progenitor-Markers-Motifs-Enriched", width = 5, height = 5, ArchRProj = projHeme5, addDOC = FALSE)
## Plotting Ggplot!
## Plotting Ggplot!

14.1.3 Motif Enrichment in Marker Peaks

Similar to the motif enrichment analyses performed on the differential peaks in the previous section, we can also perform motif enrichment on our marker peaks identified using getMarkerFeatures().

To do this, we pass our marker peak SummarizedExperiment (markerPeaks, created previously) to the peakAnnotationEnrichment() function.

enrichMotifs <- peakAnnoEnrichment(
    seMarker = markerPeaks,
    ArchRProj = projHeme5,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
  )
## ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-371b02f9a82f8-Date-2022-12-23_Time-07-58-54.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 07:58:59 : Computing Enrichments 1 of 11, 0.09 mins elapsed.
## 2022-12-23 07:59:00 : Computing Enrichments 2 of 11, 0.093 mins elapsed.
## 2022-12-23 07:59:00 : Computing Enrichments 3 of 11, 0.097 mins elapsed.
## 2022-12-23 07:59:00 : Computing Enrichments 4 of 11, 0.101 mins elapsed.
## 2022-12-23 07:59:00 : Computing Enrichments 5 of 11, 0.104 mins elapsed.
## 2022-12-23 07:59:00 : Computing Enrichments 6 of 11, 0.108 mins elapsed.
## 2022-12-23 07:59:01 : Computing Enrichments 7 of 11, 0.112 mins elapsed.
## 2022-12-23 07:59:01 : Computing Enrichments 8 of 11, 0.116 mins elapsed.
## 2022-12-23 07:59:01 : Computing Enrichments 9 of 11, 0.12 mins elapsed.
## 2022-12-23 07:59:01 : Computing Enrichments 10 of 11, 0.123 mins elapsed.
## 2022-12-23 07:59:01 : Computing Enrichments 11 of 11, 0.126 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-371b02f9a82f8-Date-2022-12-23_Time-07-58-54.log

The output of peakAnnoEnrichment() is a SummarizedExperiment object containing multiple assays that store the results of enrichment testing with the hypergeometric test.

enrichMotifs
## class: SummarizedExperiment 
## dim: 870 11 
## metadata(0):
## assays(10): mlog10Padj mlog10p ... CompareFrequency feature
## rownames(870): TFAP2B_1 TFAP2D_2 ... TBX18_869 TBX22_870
## rowData names(0):
## colnames(11): B CD4.M ... PreB Progenitor
## colData names(0):

We can directly plot these motif enrichments across all cell groups using the plotEnrichHeatmap() function. In this function, we limit the total number of motifs shown per cell group using the n parameter.

heatmapEM <- plotEnrichHeatmap(enrichMotifs, n = 7, transpose = TRUE)
## ArchR logging to : ArchRLogs/ArchR-plotEnrichHeatmap-371b047877f48-Date-2022-12-23_Time-07-59-02.log
## If there is an issue, please report to github with logFile!
## Adding Annotations..
## Preparing Main Heatmap..
## 'magick' package is suggested to install to give better rasterization.
## 
## Set `ht_opt$message = FALSE` to turn off this message.

We can diplay this plot using ComplexHeatmap::draw().

ComplexHeatmap::draw(heatmapEM, heatmap_legend_side = "bot", annotation_legend_side = "bot")

To save an editable vectorized version of this plot, we use the plotPDF() function.

plotPDF(heatmapEM, name = "Motifs-Enriched-Marker-Heatmap", width = 8, height = 6, ArchRProj = projHeme5, addDOC = FALSE)
## Plotting ComplexHeatmap!

14.1.4 Motif enrichment in arbitrary regions

It is also possible to perform the same motif enrichment analyses using the hypergeometric test in an arbitrary user-defined set of regions. To do this, we would use the customEnrichment() function. The most important things to note about this custom enrichment is that the user-defined set of regions must overlap with the peakset of the project and a peakAnnotation matches object must already have been added to the ArchRProject via addMotifAnnotations(), addArchRAnnotations(), or addPeakAnnotations().

14.1.5 Plotting motif logos

Because different motif annotations will use different position weight matrices (PWMs), it’s often useful to plot the sequence logo of the exact motif being used. Multiple tools exist to do this but the one we prefer is ggseqlogo

To do this, we must first extract the relevant motif information as a PWMatrix object.

pwm <- getPeakAnnotation(projHeme5, "Motif")$motifs[["SOX6_868"]]
pwm
## An object of class PWMatrix
## ID: ENSG00000110693_LINE19574_SOX6_I_N7
## Name: SOX6
## Matrix Class: Unknown
## strand: *
## Pseudocounts: 
## Tags: 
## $ensembl
## [1] "ENSG00000110693"
## 
## Background: 
##    A    C    G    T 
## 0.25 0.25 0.25 0.25 
## Matrix: 
##         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## A  0.2856363  1.236986  1.297203 -2.131381  1.340070  1.259222 -1.224160
## C -0.6196693 -2.173286 -3.473518  1.257839 -3.473518 -1.259465 -1.366327
## G -0.4525775 -1.628926 -1.447733 -1.817039 -2.592395 -1.817039 -3.473518
## T  0.4023151 -1.407137 -2.592395 -1.604399 -2.592395 -3.473518  1.229625
##         [,8]
## A  0.4500618
## C -0.4283456
## G  0.2154626
## T -0.6169845

Then, we convert that object to a position probability matrix. We do this with a function, which could be used in an lapply statement if you wanted to do this for many PWMatrix objects or a PWMatrixList.

PWMatrixToProbMatrix <- function(x){
  if (class(x) != "PWMatrix") stop("x must be a TFBSTools::PWMatrix object")
  m <- (exp(as(x, "matrix"))) * TFBSTools::bg(x)/sum(TFBSTools::bg(x))
  m <- t(t(m)/colSums(m))
  m
}

ppm <- PWMatrixToProbMatrix(pwm)
ppm
##        [,1]       [,2]        [,3]       [,4]        [,5]        [,6]
## A 0.3326521 0.86130341 0.914762277 0.02966834 0.954827787 0.880670100
## C 0.1345306 0.02845076 0.007751938 0.87945252 0.007751938 0.070951425
## G 0.1589967 0.04903503 0.058775647 0.04062654 0.018710138 0.040626537
## T 0.3738206 0.06121080 0.018710138 0.05025260 0.018710138 0.007751938
##          [,7]      [,8]
## A 0.073501136 0.3921023
## C 0.063760514 0.1628965
## G 0.007751938 0.3101089
## T 0.854986412 0.1348923

This Position Probability Matrix (PPM), has column sums that add to 1.

colSums(ppm) %>% range
## [1] 1 1

We can then pass this PPM to the ggseqlogo() function for plotting. We can plot as a PWM using method = "bits". Note that you need to have the ggseqlogo package installed for this to work.

library(ggseqlogo)
ggseqlogo(ppm, method = "bits")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Or we can plot as a PPM using method = "prob".

ggseqlogo(ppm, method = "prob")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.