17.4 Identification of Positive TF-Regulators

ATAC-seq allows for the unbiased identification of TFs that exhibit large changes in chromatin accessibility at sites containing their DNA binding motifs. However, families of TFs (for ex. GATA factors) share similar features in their binding motifs when looking in aggregate through position weight matrices (PWMs).

This motif similarity makes it challenging to identify the specific TFs that might be driving observed changes in chromatin accessibility at their predicted binding sites. To circumvent this challenge, we have previously ATAC-seq and RNA-seq to identify TFs whose gene expression is positively correlated to changes in the accessibility of their corresponding motif. We term these TFs “positive regulators”. However, this analysis relies on matched gene expression data which may not be readily available in all experiments. To overcome this dependency, ArchR can identify TFs whose inferred gene scores are correlated to their chromVAR TF deviation z-scores. To achieve this, ArchR correlates chromVAR deviation z-scores of TF motifs with gene activity scores of TF genes from the low-overlapping cell aggregates. When using scRNA-seq integration with ArchR, gene expression of the TF can be used instead of inferred gene activity score.

17.4.1 Step 1. Identify Deviant TF Motifs

The first part of identifying positive TF regulators is identification of deviant TF motifs. We performed this analysis in a previous chapter, creating a MotifMatrix of chromVAR deviations and deviation z-scores for all motifs. We can obtain this data, aggregated by clusters, by using the getGroupSE() function which returns a SummarizedExperiment.

seGroupMotif <- getGroupSE(ArchRProj = projHeme5, useMatrix = "MotifMatrix", groupBy = "Clusters2")
## ArchR logging to : ArchRLogs/ArchR-getGroupSE-371b04341f344-Date-2022-12-23_Time-08-52-14.log
## If there is an issue, please report to github with logFile!
## Getting Group Matrix
## 2022-12-23 08:52:23 : Successfully Created Group Matrix, 0.116 mins elapsed.
## Normalizing by number of Cells
## ArchR logging successful to : ArchRLogs/ArchR-getGroupSE-371b04341f344-Date-2022-12-23_Time-08-52-14.log

Because this SummarizedExperiment object comes from the MotifMatrix is has two seqnames - “deviations” and “z” - corresponding to the raw deviations and deviation z-scores from chromVAR.

seGroupMotif
## class: SummarizedExperiment 
## dim: 1740 11 
## metadata(0):
## assays(1): MotifMatrix
## rownames(1740): f1 f2 ... f1739 f1740
## rowData names(3): seqnames idx name
## colnames(11): B CD4.M ... PreB Progenitor
## colData names(22): TSSEnrichment ReadsInTSS ... FRIP nCells

We can subset this SummarizedExperiment to just the deviation z-scores.

seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames=="z",]

Then we can identify the maximum delta in z-score between all clusters. This will be helpful in stratifying motifs based on the degree of variation observed across clusters.

rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
  rowMaxs(assay(seZ) - assay(seZ)[,x])
}) %>% Reduce("cbind", .) %>% rowMaxs

17.4.2 Step 2. Identify Correlated TF Motifs and TF Gene Score/Expression

To identify TFs whose motif accessibility is correlated with with their own gene activity (either by gene score or gene expression), we use the correlateMatrices() function and provide the two matrices that we are interested in, in this case the GeneScoreMatrix and the MotifMatrix. As mentioned previously, these correlations are determined across many low-overlapping cell aggregates determined in the lower dimension space specified in the reducedDims parameter.

corGSM_MM <- correlateMatrices(
    ArchRProj = projHeme5,
    useMatrix1 = "GeneScoreMatrix",
    useMatrix2 = "MotifMatrix",
    reducedDims = "IterativeLSI"
)
## ArchR logging to : ArchRLogs/ArchR-correlateMatrices-371b01449776c-Date-2022-12-23_Time-08-52-25.log
## If there is an issue, please report to github with logFile!
## When accessing features from a matrix of class Sparse.Assays.Matrix it requires 1 seqname!
## Continuing with first seqname 'z'!
## If confused, try getFeatures(ArchRProj, 'MotifMatrix') to list out available seqnames for input!
## 2022-12-23 08:52:31 : Testing 825 Mappings!, 0.095 mins elapsed.
## 2022-12-23 08:52:31 : Computing KNN, 0.095 mins elapsed.
## 2022-12-23 08:52:31 : Identifying Non-Overlapping KNN pairs, 0.097 mins elapsed.
## 2022-12-23 08:52:33 : Identified 490 Groupings!, 0.124 mins elapsed.
## 2022-12-23 08:52:33 : Getting Group Matrix 1, 0.131 mins elapsed.
## 2022-12-23 08:52:51 : Getting Group Matrix 2, 0.422 mins elapsed.
## Some entries in groupMat2 are less than 0, continuing without Log2 Normalization.
## Most likely this assay is a deviations matrix.
## Getting Correlations...
## 2022-12-23 08:52:59 :
## Computing Correlation (250 of 825)
## Computing Correlation (500 of 825)
## Computing Correlation (750 of 825)
## ArchR logging successful to : ArchRLogs/ArchR-correlateMatrices-371b01449776c-Date-2022-12-23_Time-08-52-25.log

This function returns a DataFrame object that that contains the elements from each matrix and the correlation across the low-overlapping cell aggregates.

corGSM_MM
## DataFrame with 825 rows and 14 columns
##     GeneScoreMatrix_name MotifMatrix_name         cor        padj        pval
##              <character>      <character>   <numeric>   <numeric>   <numeric>
## 1                   HES4          HES4_95  -0.0222476 1.00000e+00 6.23232e-01
## 2                   HES5          HES5_98   0.1930514 1.37831e-02 1.68498e-05
## 3                 PRDM16       PRDM16_211   0.4968602 5.34096e-29 6.52929e-32
## 4                   TP73         TP73_705   0.4367128 2.52781e-21 3.09024e-24
## 5               TP73-AS1         TP73_705  -0.1481933 8.18935e-01 1.00114e-03
## ...                  ...              ...         ...         ...         ...
## 821                TFDP3        TFDP3_309 -0.00717456           1    0.874132
## 822               ZNF75D       ZNF75D_272  0.00376659           1    0.933721
## 823                 ZIC3         ZIC3_215 -0.03655746           1    0.419415
## 824                 SOX3         SOX3_759 -0.00537698           1    0.905496
## 825                MECP2        MECP2_645  0.01997529           1    0.659150
##     GeneScoreMatrix_seqnames GeneScoreMatrix_start GeneScoreMatrix_end
##                  <character>             <integer>           <integer>
## 1                       chr1                935552              934342
## 2                       chr1               2461684             2460184
## 3                       chr1               2985742             3355185
## 4                       chr1               3569129             3652765
## 5                       chr1               3663937             3652548
## ...                      ...                   ...                 ...
## 821                     chrX             132352376           132350697
## 822                     chrX             134429965           134419723
## 823                     chrX             136648346           136654259
## 824                     chrX             139587225           139585152
## 825                     chrX             153363188           153287264
##     GeneScoreMatrix_strand GeneScoreMatrix_idx GeneScoreMatrix_matchName
##                  <integer>           <integer>               <character>
## 1                        2                  15                      HES4
## 2                        2                  74                      HES5
## 3                        1                  82                    PRDM16
## 4                        1                  89                      TP73
## 5                        2                  90                      TP73
## ...                    ...                 ...                       ...
## 821                      2                 697                     TFDP3
## 822                      2                 728                    ZNF75D
## 823                      1                 753                      ZIC3
## 824                      2                 765                      SOX3
## 825                      2                 874                     MECP2
##     MotifMatrix_seqnames MotifMatrix_idx MotifMatrix_matchName
##              <character>       <integer>           <character>
## 1                      z              95                  HES4
## 2                      z              98                  HES5
## 3                      z             211                PRDM16
## 4                      z             705                  TP73
## 5                      z             705                  TP73
## ...                  ...             ...                   ...
## 821                    z             309                 TFDP3
## 822                    z             272                ZNF75D
## 823                    z             215                  ZIC3
## 824                    z             759                  SOX3
## 825                    z             645                 MECP2

We can perform the same analysis using the GeneIntegrationMatrix instead of the GeneScoreMatrix.

corGIM_MM <- correlateMatrices(
    ArchRProj = projHeme5,
    useMatrix1 = "GeneIntegrationMatrix",
    useMatrix2 = "MotifMatrix",
    reducedDims = "IterativeLSI"
)
## ArchR logging to : ArchRLogs/ArchR-correlateMatrices-371b05aa57e2d-Date-2022-12-23_Time-08-52-59.log
## If there is an issue, please report to github with logFile!
## When accessing features from a matrix of class Sparse.Assays.Matrix it requires 1 seqname!
## Continuing with first seqname 'z'!
## If confused, try getFeatures(ArchRProj, 'MotifMatrix') to list out available seqnames for input!
## 2022-12-23 08:53:05 : Testing 798 Mappings!, 0.086 mins elapsed.
## 2022-12-23 08:53:05 : Computing KNN, 0.087 mins elapsed.
## 2022-12-23 08:53:05 : Identifying Non-Overlapping KNN pairs, 0.089 mins elapsed.
## 2022-12-23 08:53:07 : Identified 490 Groupings!, 0.119 mins elapsed.
## 2022-12-23 08:53:07 : Getting Group Matrix 1, 0.127 mins elapsed.
## 2022-12-23 08:53:23 : Getting Group Matrix 2, 0.397 mins elapsed.
## Some entries in groupMat2 are less than 0, continuing without Log2 Normalization.
## Most likely this assay is a deviations matrix.
## Getting Correlations...
## 2022-12-23 08:53:32 :
## Computing Correlation (250 of 798)
## Computing Correlation (500 of 798)
## Computing Correlation (750 of 798)
## ArchR logging successful to : ArchRLogs/ArchR-correlateMatrices-371b05aa57e2d-Date-2022-12-23_Time-08-52-59.log
corGIM_MM
## DataFrame with 798 rows and 14 columns
##     GeneIntegrationMatrix_name MotifMatrix_name        cor        padj
##                    <character>      <character>  <numeric>   <numeric>
## 1                         HES4          HES4_95 -0.3948468 5.14257e-17
## 2                         HES5          HES5_98 -0.5198649 1.47068e-32
## 3                       PRDM16       PRDM16_211  0.0870686 1.00000e+00
## 4                         TP73         TP73_705         NA          NA
## 5                         HES2          HES2_19  0.2852406 6.53411e-08
## ...                        ...              ...        ...         ...
## 794                      TFDP3        TFDP3_309         NA          NA
## 795                     ZNF75D       ZNF75D_272   0.474335 3.83736e-26
## 796                       ZIC3         ZIC3_215         NA          NA
## 797                       SOX3         SOX3_759         NA          NA
## 798                      MECP2        MECP2_645   0.455404 9.57311e-24
##            pval GeneIntegrationMatrix_seqnames GeneIntegrationMatrix_start
##       <numeric>                    <character>                   <integer>
## 1   9.90860e-20                           chr1                      935552
## 2   2.83368e-35                           chr1                     2461684
## 3   5.40941e-02                           chr1                     2985742
## 4            NA                           chr1                     3569129
## 5   1.25898e-10                           chr1                     6484730
## ...         ...                            ...                         ...
## 794          NA                           chrX                   132352376
## 795 7.39377e-29                           chrX                   134429965
## 796          NA                           chrX                   136648346
## 797          NA                           chrX                   139587225
## 798 1.84453e-26                           chrX                   153363188
##     GeneIntegrationMatrix_end GeneIntegrationMatrix_strand
##                     <integer>                    <integer>
## 1                      934342                            2
## 2                     2460184                            2
## 3                     3355185                            1
## 4                     3652765                            1
## 5                     6472498                            2
## ...                       ...                          ...
## 794                 132350697                            2
## 795                 134419723                            2
## 796                 136654259                            1
## 797                 139585152                            2
## 798                 153287264                            2
##     GeneIntegrationMatrix_idx GeneIntegrationMatrix_matchName
##                     <integer>                     <character>
## 1                           8                            HES4
## 2                          53                            HES5
## 3                          59                          PRDM16
## 4                          64                            TP73
## 5                          81                            HES2
## ...                       ...                             ...
## 794                       562                           TFDP3
## 795                       576                          ZNF75D
## 796                       595                            ZIC3
## 797                       602                            SOX3
## 798                       680                           MECP2
##     MotifMatrix_seqnames MotifMatrix_idx MotifMatrix_matchName
##              <character>       <integer>           <character>
## 1                      z              95                  HES4
## 2                      z              98                  HES5
## 3                      z             211                PRDM16
## 4                      z             705                  TP73
## 5                      z              19                  HES2
## ...                  ...             ...                   ...
## 794                    z             309                 TFDP3
## 795                    z             272                ZNF75D
## 796                    z             215                  ZIC3
## 797                    z             759                  SOX3
## 798                    z             645                 MECP2

Step 3. Add Maximum Delta Deviation to the Correlation Data Frame

For each of these correlation analyses, we can annotate each motif with the maximum delta observed between clusters which we calculated in Step 1.

corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
corGIM_MM$maxDelta <- rowData(seZ)[match(corGIM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]

Step 4. Identify Positive TF Regulators

We can use all of this information to identify positive TF regulators. In the examples below, we consider positive regulators as those TFs whose correlation between motif and gene score (or gene expression) is greater than 0.5 with an adjusted p-value less than 0.01 and a maximum inter-cluster difference in deviation z-score that is in the top quartile.

We apply these selection criteria and do a little text juggling to isolate the TF names.

corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*","",corGSM_MM[,"MotifMatrix_name"]))), ]
corGSM_MM$TFRegulator <- "NO"
corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.5 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "YES"
sort(corGSM_MM[corGSM_MM$TFRegulator=="YES",1])
##  [1] "ASCL1"    "BCL11A"   "CEBPA-DT" "CEBPB"    "CEBPD"    "EBF1"    
##  [7] "EGR1"     "EGR2"     "EGR4"     "EOMES"    "ERF"      "ETS1"    
## [13] "ETV6"     "FUBP1"    "GATA1"    "GATA2"    "GATA5"    "IRF2"    
## [19] "JDP2"     "KLF11"    "LEF1"     "MECOM"    "MEF2A"    "MITF"    
## [25] "NFE2"     "NFIA"     "NFIC"     "NFIL3"    "NFIX"     "NFKB2"   
## [31] "PAX5"     "POU2F1"   "REL"      "SMAD1"    "SPI1"     "SPIB"    
## [37] "TAL1"     "TCF15"    "TFAP2C"   "YY1"      "ZEB1-AS1"

Having identified these positive TF regulators from gene scores and motif deviation z-scores, we can highlight them in a dot plot.

p <- ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
  geom_point() + 
  theme_ArchR() +
  geom_vline(xintercept = 0, lty = "dashed") + 
  scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
  xlab("Correlation To Gene Score") +
  ylab("Max TF Motif Delta") +
  scale_y_continuous(
    expand = c(0,0), 
    limits = c(0, max(corGSM_MM$maxDelta)*1.05)
  )

p
## Warning: Removed 7 rows containing missing values (geom_point).

We can perform the same analysis for the correlations derived from our GeneIntegrationMatrix.

corGIM_MM <- corGIM_MM[order(abs(corGIM_MM$cor), decreasing = TRUE), ]
corGIM_MM <- corGIM_MM[which(!duplicated(gsub("\\-.*","",corGIM_MM[,"MotifMatrix_name"]))), ]
corGIM_MM$TFRegulator <- "NO"
corGIM_MM$TFRegulator[which(corGIM_MM$cor > 0.5 & corGIM_MM$padj < 0.01 & corGIM_MM$maxDelta > quantile(corGIM_MM$maxDelta, 0.75))] <- "YES"
sort(corGIM_MM[corGIM_MM$TFRegulator=="YES",1])
##  [1] "BACH1"   "CEBPA"   "CEBPB"   "CEBPD"   "CEBPE"   "CTCF"    "EBF1"   
##  [8] "EOMES"   "ETS1"    "FOS"     "FOSL2"   "GATA1"   "GATA2"   "ID3"    
## [15] "IRF7"    "JDP2"    "KLF2"    "LEF1"    "MECOM"   "MEF2A"   "MITF"   
## [22] "NFE2"    "NFIA"    "NFIB"    "NFIC"    "NFIL3"   "NFIX"    "NFKB2"  
## [29] "NR4A1"   "PAX5"    "POU2F1"  "POU5F1B" "RFX3"    "RUNX1"   "SPI1"   
## [36] "STAT2"   "TCF3"    "TCF4"    "TCF7L1"
p <- ggplot(data.frame(corGIM_MM), aes(cor, maxDelta, color = TFRegulator)) +
  geom_point() + 
  theme_ArchR() +
  geom_vline(xintercept = 0, lty = "dashed") + 
  scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
  xlab("Correlation To Gene Expression") +
  ylab("Max TF Motif Delta") +
  scale_y_continuous(
    expand = c(0,0), 
    limits = c(0, max(corGIM_MM$maxDelta)*1.05)
  )

p
## Warning: Removed 260 rows containing missing values (geom_point).