ATAC-seq allows for the unbiased identification of TFs that exhibit large changes in chromatin accessibility at sites containing their DNA binding motifs. However, families of TFs (for ex. GATA factors) share similar features in their binding motifs when looking in aggregate through position weight matrices (PWMs).
This motif similarity makes it challenging to identify the specific TFs that might be driving observed changes in chromatin accessibility at their predicted binding sites. To circumvent this challenge, we have previously ATAC-seq and RNA-seq to identify TFs whose gene expression is positively correlated to changes in the accessibility of their corresponding motif. We term these TFs “positive regulators”. However, this analysis relies on matched gene expression data which may not be readily available in all experiments. To overcome this dependency, ArchR can identify TFs whose inferred gene scores are correlated to their chromVAR TF deviation z-scores. To achieve this, ArchR correlates chromVAR deviation z-scores of TF motifs with gene activity scores of TF genes from the low-overlapping cell aggregates. When using scRNA-seq integration with ArchR, gene expression of the TF can be used instead of inferred gene activity score.
Step 1. Identify Deviant TF Motifs
The first part of identifying positive TF regulators is identification of deviant TF motifs. We performed this analysis in a previous chapter, creating a MotifMatrix
of chromVAR deviations and deviation z-scores for all motifs. We can obtain this data, aggregated by clusters, by using the getGroupSE()
function which returns a SummarizedExperiment
.
seGroupMotif <- getGroupSE(ArchRProj = projHeme5, useMatrix = "MotifMatrix", groupBy = "Clusters2")
## ArchR logging to : ArchRLogs/ArchR-getGroupSE-371b04341f344-Date-2022-12-23_Time-08-52-14.log
## If there is an issue, please report to github with logFile!
## Getting Group Matrix
## 2022-12-23 08:52:23 : Successfully Created Group Matrix, 0.116 mins elapsed.
## Normalizing by number of Cells
## ArchR logging successful to : ArchRLogs/ArchR-getGroupSE-371b04341f344-Date-2022-12-23_Time-08-52-14.log
Because this SummarizedExperiment
object comes from the MotifMatrix
is has two seqnames - “deviations” and “z” - corresponding to the raw deviations and deviation z-scores from chromVAR.
seGroupMotif
## class: SummarizedExperiment
## dim: 1740 11
## metadata(0):
## assays(1): MotifMatrix
## rownames(1740): f1 f2 ... f1739 f1740
## rowData names(3): seqnames idx name
## colnames(11): B CD4.M ... PreB Progenitor
## colData names(22): TSSEnrichment ReadsInTSS ... FRIP nCells
We can subset this SummarizedExperiment
to just the deviation z-scores.
seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames=="z",]
Then we can identify the maximum delta in z-score between all clusters. This will be helpful in stratifying motifs based on the degree of variation observed across clusters.
rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
rowMaxs(assay(seZ) - assay(seZ)[,x])
}) %>% Reduce("cbind", .) %>% rowMaxs
Step 2. Identify Correlated TF Motifs and TF Gene Score/Expression
To identify TFs whose motif accessibility is correlated with with their own gene activity (either by gene score or gene expression), we use the correlateMatrices()
function and provide the two matrices that we are interested in, in this case the GeneScoreMatrix
and the MotifMatrix
. As mentioned previously, these correlations are determined across many low-overlapping cell aggregates determined in the lower dimension space specified in the reducedDims
parameter.
corGSM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneScoreMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
## ArchR logging to : ArchRLogs/ArchR-correlateMatrices-371b01449776c-Date-2022-12-23_Time-08-52-25.log
## If there is an issue, please report to github with logFile!
## When accessing features from a matrix of class Sparse.Assays.Matrix it requires 1 seqname!
## Continuing with first seqname 'z'!
## If confused, try getFeatures(ArchRProj, 'MotifMatrix') to list out available seqnames for input!
## 2022-12-23 08:52:31 : Testing 825 Mappings!, 0.095 mins elapsed.
## 2022-12-23 08:52:31 : Computing KNN, 0.095 mins elapsed.
## 2022-12-23 08:52:31 : Identifying Non-Overlapping KNN pairs, 0.097 mins elapsed.
## 2022-12-23 08:52:33 : Identified 490 Groupings!, 0.124 mins elapsed.
## 2022-12-23 08:52:33 : Getting Group Matrix 1, 0.131 mins elapsed.
## 2022-12-23 08:52:51 : Getting Group Matrix 2, 0.422 mins elapsed.
## Some entries in groupMat2 are less than 0, continuing without Log2 Normalization.
## Most likely this assay is a deviations matrix.
## Getting Correlations...
## 2022-12-23 08:52:59 :
## Computing Correlation (250 of 825)
## Computing Correlation (500 of 825)
## Computing Correlation (750 of 825)
## ArchR logging successful to : ArchRLogs/ArchR-correlateMatrices-371b01449776c-Date-2022-12-23_Time-08-52-25.log
This function returns a DataFrame
object that that contains the elements from each matrix and the correlation across the low-overlapping cell aggregates.
corGSM_MM
## DataFrame with 825 rows and 14 columns
## GeneScoreMatrix_name MotifMatrix_name cor padj pval
## <character> <character> <numeric> <numeric> <numeric>
## 1 HES4 HES4_95 -0.0222476 1.00000e+00 6.23232e-01
## 2 HES5 HES5_98 0.1930514 1.37831e-02 1.68498e-05
## 3 PRDM16 PRDM16_211 0.4968602 5.34096e-29 6.52929e-32
## 4 TP73 TP73_705 0.4367128 2.52781e-21 3.09024e-24
## 5 TP73-AS1 TP73_705 -0.1481933 8.18935e-01 1.00114e-03
## ... ... ... ... ... ...
## 821 TFDP3 TFDP3_309 -0.00717456 1 0.874132
## 822 ZNF75D ZNF75D_272 0.00376659 1 0.933721
## 823 ZIC3 ZIC3_215 -0.03655746 1 0.419415
## 824 SOX3 SOX3_759 -0.00537698 1 0.905496
## 825 MECP2 MECP2_645 0.01997529 1 0.659150
## GeneScoreMatrix_seqnames GeneScoreMatrix_start GeneScoreMatrix_end
## <character> <integer> <integer>
## 1 chr1 935552 934342
## 2 chr1 2461684 2460184
## 3 chr1 2985742 3355185
## 4 chr1 3569129 3652765
## 5 chr1 3663937 3652548
## ... ... ... ...
## 821 chrX 132352376 132350697
## 822 chrX 134429965 134419723
## 823 chrX 136648346 136654259
## 824 chrX 139587225 139585152
## 825 chrX 153363188 153287264
## GeneScoreMatrix_strand GeneScoreMatrix_idx GeneScoreMatrix_matchName
## <integer> <integer> <character>
## 1 2 15 HES4
## 2 2 74 HES5
## 3 1 82 PRDM16
## 4 1 89 TP73
## 5 2 90 TP73
## ... ... ... ...
## 821 2 697 TFDP3
## 822 2 728 ZNF75D
## 823 1 753 ZIC3
## 824 2 765 SOX3
## 825 2 874 MECP2
## MotifMatrix_seqnames MotifMatrix_idx MotifMatrix_matchName
## <character> <integer> <character>
## 1 z 95 HES4
## 2 z 98 HES5
## 3 z 211 PRDM16
## 4 z 705 TP73
## 5 z 705 TP73
## ... ... ... ...
## 821 z 309 TFDP3
## 822 z 272 ZNF75D
## 823 z 215 ZIC3
## 824 z 759 SOX3
## 825 z 645 MECP2
We can perform the same analysis using the GeneIntegrationMatrix
instead of the GeneScoreMatrix
.
corGIM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneIntegrationMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
## ArchR logging to : ArchRLogs/ArchR-correlateMatrices-371b05aa57e2d-Date-2022-12-23_Time-08-52-59.log
## If there is an issue, please report to github with logFile!
## When accessing features from a matrix of class Sparse.Assays.Matrix it requires 1 seqname!
## Continuing with first seqname 'z'!
## If confused, try getFeatures(ArchRProj, 'MotifMatrix') to list out available seqnames for input!
## 2022-12-23 08:53:05 : Testing 798 Mappings!, 0.086 mins elapsed.
## 2022-12-23 08:53:05 : Computing KNN, 0.087 mins elapsed.
## 2022-12-23 08:53:05 : Identifying Non-Overlapping KNN pairs, 0.089 mins elapsed.
## 2022-12-23 08:53:07 : Identified 490 Groupings!, 0.119 mins elapsed.
## 2022-12-23 08:53:07 : Getting Group Matrix 1, 0.127 mins elapsed.
## 2022-12-23 08:53:23 : Getting Group Matrix 2, 0.397 mins elapsed.
## Some entries in groupMat2 are less than 0, continuing without Log2 Normalization.
## Most likely this assay is a deviations matrix.
## Getting Correlations...
## 2022-12-23 08:53:32 :
## Computing Correlation (250 of 798)
## Computing Correlation (500 of 798)
## Computing Correlation (750 of 798)
## ArchR logging successful to : ArchRLogs/ArchR-correlateMatrices-371b05aa57e2d-Date-2022-12-23_Time-08-52-59.log
corGIM_MM
## DataFrame with 798 rows and 14 columns
## GeneIntegrationMatrix_name MotifMatrix_name cor padj
## <character> <character> <numeric> <numeric>
## 1 HES4 HES4_95 -0.3948468 5.14257e-17
## 2 HES5 HES5_98 -0.5198649 1.47068e-32
## 3 PRDM16 PRDM16_211 0.0870686 1.00000e+00
## 4 TP73 TP73_705 NA NA
## 5 HES2 HES2_19 0.2852406 6.53411e-08
## ... ... ... ... ...
## 794 TFDP3 TFDP3_309 NA NA
## 795 ZNF75D ZNF75D_272 0.474335 3.83736e-26
## 796 ZIC3 ZIC3_215 NA NA
## 797 SOX3 SOX3_759 NA NA
## 798 MECP2 MECP2_645 0.455404 9.57311e-24
## pval GeneIntegrationMatrix_seqnames GeneIntegrationMatrix_start
## <numeric> <character> <integer>
## 1 9.90860e-20 chr1 935552
## 2 2.83368e-35 chr1 2461684
## 3 5.40941e-02 chr1 2985742
## 4 NA chr1 3569129
## 5 1.25898e-10 chr1 6484730
## ... ... ... ...
## 794 NA chrX 132352376
## 795 7.39377e-29 chrX 134429965
## 796 NA chrX 136648346
## 797 NA chrX 139587225
## 798 1.84453e-26 chrX 153363188
## GeneIntegrationMatrix_end GeneIntegrationMatrix_strand
## <integer> <integer>
## 1 934342 2
## 2 2460184 2
## 3 3355185 1
## 4 3652765 1
## 5 6472498 2
## ... ... ...
## 794 132350697 2
## 795 134419723 2
## 796 136654259 1
## 797 139585152 2
## 798 153287264 2
## GeneIntegrationMatrix_idx GeneIntegrationMatrix_matchName
## <integer> <character>
## 1 8 HES4
## 2 53 HES5
## 3 59 PRDM16
## 4 64 TP73
## 5 81 HES2
## ... ... ...
## 794 562 TFDP3
## 795 576 ZNF75D
## 796 595 ZIC3
## 797 602 SOX3
## 798 680 MECP2
## MotifMatrix_seqnames MotifMatrix_idx MotifMatrix_matchName
## <character> <integer> <character>
## 1 z 95 HES4
## 2 z 98 HES5
## 3 z 211 PRDM16
## 4 z 705 TP73
## 5 z 19 HES2
## ... ... ... ...
## 794 z 309 TFDP3
## 795 z 272 ZNF75D
## 796 z 215 ZIC3
## 797 z 759 SOX3
## 798 z 645 MECP2
Step 3. Add Maximum Delta Deviation to the Correlation Data Frame
For each of these correlation analyses, we can annotate each motif with the maximum delta observed between clusters which we calculated in Step 1.
corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
corGIM_MM$maxDelta <- rowData(seZ)[match(corGIM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
Step 4. Identify Positive TF Regulators
We can use all of this information to identify positive TF regulators. In the examples below, we consider positive regulators as those TFs whose correlation between motif and gene score (or gene expression) is greater than 0.5 with an adjusted p-value less than 0.01 and a maximum inter-cluster difference in deviation z-score that is in the top quartile.
We apply these selection criteria and do a little text juggling to isolate the TF names.
corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*","",corGSM_MM[,"MotifMatrix_name"]))), ]
corGSM_MM$TFRegulator <- "NO"
corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.5 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "YES"
sort(corGSM_MM[corGSM_MM$TFRegulator=="YES",1])
## [1] "ASCL1" "BCL11A" "CEBPA-DT" "CEBPB" "CEBPD" "EBF1"
## [7] "EGR1" "EGR2" "EGR4" "EOMES" "ERF" "ETS1"
## [13] "ETV6" "FUBP1" "GATA1" "GATA2" "GATA5" "IRF2"
## [19] "JDP2" "KLF11" "LEF1" "MECOM" "MEF2A" "MITF"
## [25] "NFE2" "NFIA" "NFIC" "NFIL3" "NFIX" "NFKB2"
## [31] "PAX5" "POU2F1" "REL" "SMAD1" "SPI1" "SPIB"
## [37] "TAL1" "TCF15" "TFAP2C" "YY1" "ZEB1-AS1"
Having identified these positive TF regulators from gene scores and motif deviation z-scores, we can highlight them in a dot plot.
p <- ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Score") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGSM_MM$maxDelta)*1.05)
)
p
## Warning: Removed 7 rows containing missing values (geom_point).
We can perform the same analysis for the correlations derived from our GeneIntegrationMatrix
.
corGIM_MM <- corGIM_MM[order(abs(corGIM_MM$cor), decreasing = TRUE), ]
corGIM_MM <- corGIM_MM[which(!duplicated(gsub("\\-.*","",corGIM_MM[,"MotifMatrix_name"]))), ]
corGIM_MM$TFRegulator <- "NO"
corGIM_MM$TFRegulator[which(corGIM_MM$cor > 0.5 & corGIM_MM$padj < 0.01 & corGIM_MM$maxDelta > quantile(corGIM_MM$maxDelta, 0.75))] <- "YES"
sort(corGIM_MM[corGIM_MM$TFRegulator=="YES",1])
## [1] "BACH1" "CEBPA" "CEBPB" "CEBPD" "CEBPE" "CTCF" "EBF1"
## [8] "EOMES" "ETS1" "FOS" "FOSL2" "GATA1" "GATA2" "ID3"
## [15] "IRF7" "JDP2" "KLF2" "LEF1" "MECOM" "MEF2A" "MITF"
## [22] "NFE2" "NFIA" "NFIB" "NFIC" "NFIL3" "NFIX" "NFKB2"
## [29] "NR4A1" "PAX5" "POU2F1" "POU5F1B" "RFX3" "RUNX1" "SPI1"
## [36] "STAT2" "TCF3" "TCF4" "TCF7L1"
p <- ggplot(data.frame(corGIM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Expression") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGIM_MM$maxDelta)*1.05)
)
p
## Warning: Removed 260 rows containing missing values (geom_point).