10.1 Cross-platform linkage of scATAC-seq cells with scRNA-seq cells
In order to integrate our tutorial scATAC-seq data with matched scRNA-seq data, we will use scRNA-seq data derived from the same hematopoietic cell types from Granja* et al (2019).
We have stored this scRNA-seq data as a 111 MB RangedSummarizedExperiment
object. However, ArchR also accepts unmodified Seurat
objects as input to the integration workflow. However, Seurat SCTransform-normalized data is not suitable for this integration workflow. Downloading and examining this object, we see that it has a gene expression counts matrix and associated metadata.
if(!file.exists("scRNA-Hematopoiesis-Granja-2019.rds")){
download.file(
url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/scRNA-Hematopoiesis-Granja-2019.rds",
destfile = "scRNA-Hematopoiesis-Granja-2019.rds"
)
}
<- readRDS("scRNA-Hematopoiesis-Granja-2019.rds")
seRNA
seRNA## class: RangedSummarizedExperiment
## dim: 20287 35582
## metadata(0):
## assays(1): counts
## rownames(20287): FAM138A OR4F5 ... S100B PRMT2
## rowData names(3): gene_name gene_id exonLength
## colnames(35582): CD34_32_R5:AAACCTGAGTATCGAA-1
## CD34_32_R5:AAACCTGAGTCGTTTG-1 ...
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATGTGTCA-1
## BMMC_10x_GREENLEAF_REP2:TTTGTTGCATTGAAAG-1
## colData names(10): Group nUMI_pre ... BioClassification Barcode
The metadata contains a column called BioClassification
which contains the cell type classifications for each cell in the scRNA-seq dataset.
colnames(colData(seRNA))
## [1] "Group" "nUMI_pre" "nUMI"
## [4] "nGene" "initialClusters" "UMAP1"
## [7] "UMAP2" "Clusters" "BioClassification"
## [10] "Barcode"
Using table()
we can see how many cells are in each of the scRNA-seq cell type classifications.
table(colData(seRNA)$BioClassification)
##
## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## 1425 1653 446 111 2260
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## 903 2097 1050 544 325
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## 1800 4222 292 520 377
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## 710 1711 62 1521 2470
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## 2364 3539 796 2080 2143
## 26_Unk
## 161
There are two types of integration approaches that we can perform. Unconstrained integration is a completely agnostic approach that would take all of the cells in your scATAC-seq experiment and attempt to align them to any of the cells in the scRNA-seq experiment. While this is a feasible preliminary solution, we can improve the quality of our cross-platform alignment by constraining the integration process. To perform a constrained integration we use prior knowledge of the cell types to limit the search space of the alignment. For example, if we knew that Clusters A, B, and C in the scATAC-seq data corresponded to 3 different T cell clusters, and we knew that Clusters X and Y in the scRNA-seq data corresponded to 2 different T cell clusters, we could tell ArchR to specifically try to align cells from scATAC-seq Clusters A, B, and C to cell from scRNA-seq clusters X and Y. We illustrate these two approaches below, first performing an unconstrained integration to acheive preliminary cluster identities and then using this prior knowledge to perform a more refined constrained integration. However, please note that your own application must guide which type of integration you perform. Constrained integration may not be suited for all applications so the below sections are merely a guide for how to accomplish this and should not be viewed as a “tutorial” on how to integrate your own data.
10.1.1 Unconstrained Integration
To integrate scATAC-seq with scRNA-seq, we use the addGeneIntegrationMatrix()
function. As mentioned above, this function accepts either a Seurat
object or a RangedSummarizedExperiment
object via the seRNA
parameter. The first round of integration we perform will be a preliminary unconstrained integration and we will not store this in the Arrow files (addToArrow = FALSE
). We provide a name for the integration matrix that will be stored in the ArchRProject
via the matrixName
parameter. The other key parameters for this function provide column names in cellColData
where certain information will be stored: nameCell
will store the cell ID from the matched scRNA-seq cell, nameGroup
will store the group ID from the scRNA-seq cell, and nameScore
will store the cross-platform integration score.
<- addGeneIntegrationMatrix(
projHeme2 ArchRProj = projHeme2,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupRNA = "BioClassification",
nameCell = "predictedCell_Un",
nameGroup = "predictedGroup_Un",
nameScore = "predictedScore_Un"
)## ArchR logging to : ArchRLogs/ArchR-addGeneIntegrationMatrix-371b01d951f4d-Date-2022-12-23_Time-06-41-47.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 06:41:48 : Running Seurat's Integration Stuart* et al 2019, 0.018 mins elapsed.
## 2022-12-23 06:41:48 : Checking ATAC Input, 0.022 mins elapsed.
## 2022-12-23 06:41:50 : Checking RNA Input, 0.047 mins elapsed.
## 2022-12-23 06:41:56 : Found 18601 overlapping gene names from gene scores and rna matrix!, 0.156 mins elapsed.
## 2022-12-23 06:41:56 : Creating Integration Blocks, 0.156 mins elapsed.
## 2022-12-23 06:41:57 : Prepping Interation Data, 0.161 mins elapsed.
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2022-12-23 06:41:58 : Computing Integration in 1 Integration Blocks!, 0 mins elapsed.
## 2022-12-23 06:41:58 : Block (1 of 1) : Computing Integration, 0 mins elapsed.
## 2022-12-23 06:41:59 : Block (1 of 1) : Identifying Variable Genes, 0.017 mins elapsed.
## 2022-12-23 06:42:01 : Block (1 of 1) : Getting GeneScoreMatrix, 0.044 mins elapsed.
## 2022-12-23 06:42:10 : Block (1 of 1) : Imputing GeneScoreMatrix, 0.19 mins elapsed.
## Getting ImputeWeights
## Warning: The following arguments are not used: row.names
## 2022-12-23 06:42:39 : Block (1 of 1) : Seurat FindTransferAnchors, 0.685 mins elapsed.
## 2022-12-23 06:43:37 : Block (1 of 1) : Seurat TransferData Cell Group Labels, 1.649 mins elapsed.
## 2022-12-23 06:43:39 : Block (1 of 1) : Seurat TransferData Cell Names Labels, 1.685 mins elapsed.
## 2022-12-23 06:43:54 : Block (1 of 1) : Saving TransferAnchors Joint CCA, 1.937 mins elapsed.
## 2022-12-23 06:43:57 : Block (1 of 1) : Completed Integration, 1.975 mins elapsed.
## 2022-12-23 06:43:58 : Block (1 of 1) : Plotting Joint UMAP, 2.001 mins elapsed.
## Length of unique values greater than palette, interpolating..
## 2022-12-23 06:44:32 : Completed Integration with RNA Matrix, 2.565 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addGeneIntegrationMatrix-371b01d951f4d-Date-2022-12-23_Time-06-41-47.log
This unconstrained integration can be used as shown below to perform a more refined constrained integration.
You may want to assess the “quality” of the integration between your scATAC-seq and scRNA-seq data. ArchR provides two general ways to do this. The first is the option to ouput a UMAP plot of the joint CCA subspace. This is enabled through the plotUMAP
and UMAPParams
parameters to addGeneIntegrationMatrix()
. If plotUMAP = TRUE
, then a UMAP plot is saved for each block of the integration in a subfolder (based on matrixName
) of the “RNAIntegration” folder. These plots can be difficult to interpret, especially with the intercellular heterogeneity isn’t very strong. As an alternative, ArchR also stores the NAME.score
value from Seurat::TransferData()
in a column of cellColData
indicated by what you pass to the nameScore
parameter.
10.1.2 Constrained Integration
Now that we have our preliminary unconstrained integration, we will identify general cell types to profide a framework to further refine the integration results.
Given that this tutorial data is from hematopoietic cells, we would ideally constrain the integration to associate similar cell types togther. First, we will identify which cell types from the scRNA-seq data are most abundant in each of our scATAC-seq clusters. The goal of this is to identify the cells in both the scATAC-seq and scRNA-seq data that correspond to T cells and NK cells using the unconstrained integration so that we can perform a constrained integration using this information. To do this, we will create a confusionMatrix
that looks at the intersection of Clusters
and predictedGroup_Un
which contains the cell types as identified by scRNA-seq.
<- as.matrix(confusionMatrix(projHeme2$Clusters, projHeme2$predictedGroup_Un))
cM <- colnames(cM)[apply(cM, 1 , which.max)]
preClust cbind(preClust, rownames(cM)) #Assignments
## preClust
## [1,] "17_B" "C9"
## [2,] "25_NK" "C6"
## [3,] "16_Pre.B" "C10"
## [4,] "08_GMP.Neut" "C4"
## [5,] "11_CD14.Mono.1" "C1"
## [6,] "03_Late.Eryth" "C3"
## [7,] "20_CD4.N1" "C7"
## [8,] "01_HSC" "C5"
## [9,] "22_CD4.M" "C8"
## [10,] "09_pDC" "C11"
## [11,] "15_CLP.2" "C12"
## [12,] "12_CD14.Mono.2" "C2"
The above list shows which scRNA-seq cell type is most abundant in each of the 12 scATAC-seq clusters.
First, lets look at the cell type labels from our scRNA-seq data that were used in our unconstrained integration:
unique(unique(projHeme2$predictedGroup_Un))
## [1] "08_GMP.Neut" "25_NK" "16_Pre.B" "06_CLP.1"
## [5] "11_CD14.Mono.1" "07_GMP" "02_Early.Eryth" "22_CD4.M"
## [9] "05_CMP.LMPP" "03_Late.Eryth" "17_B" "09_pDC"
## [13] "20_CD4.N1" "10_cDC" "21_CD4.N2" "12_CD14.Mono.2"
## [17] "01_HSC" "15_CLP.2" "13_CD16.Mono" "19_CD8.N"
## [21] "23_CD8.EM"
From the above list, we can see that the clusters in the scRNA-seq data that correspond to T cells and NK cells are Clusters 19-25. We will create a string-based representation of these clusters to use in the downstream constrained integration.
#From scRNA
<- paste0(paste0(19:25), collapse="|")
cTNK
cTNK## [1] "19|20|21|22|23|24|25"
We can then take all of the other clusters and create a string-based representation of all “Non-T cell, Non-NK cell” clusters (i.e. Cluster 1 - 18).
<- paste0(c(paste0("0", 1:9), 10:13, 15:18), collapse="|")
cNonTNK
cNonTNK## [1] "01|02|03|04|05|06|07|08|09|10|11|12|13|15|16|17|18"
These string-based representations are pattern matching strings that we will use with grep
to extract the scATAC-seq clusters that correspond to these scRNA-seq cell types. The |
in the string acts as an or
statement so we end up searching for any row in the preClust
column of our confusion matrix that matches one of the scRNA-seq cluster numbers provided in the pattern match string.
For T cells and NK cells, this identifies scATAC-seq clusters C7, C8, and C9:
<- rownames(cM)[grep(cTNK, preClust)]
clustTNK
clustTNK## [1] "C6" "C7" "C8"
For Non-T cells and Non-NK cells, this identifies the remaining scATAC-seq clusters:
<- rownames(cM)[grep(cNonTNK, preClust)]
clustNonTNK
clustNonTNK## [1] "C9" "C10" "C4" "C1" "C3" "C5" "C11" "C12" "C2"
We then perform a similar opperation to identify the scRNA-seq cells that correspond to these same cell types. First, we identify the T cell and NK cells in the scRNA-seq data
<- colnames(seRNA)[grep(cTNK, colData(seRNA)$BioClassification)]
rnaTNK head(rnaTNK)
## [1] "PBMC_10x_GREENLEAF_REP1:AAACCCAGTCGTCATA-1"
## [2] "PBMC_10x_GREENLEAF_REP1:AAACCCATCCGATGTA-1"
## [3] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCAACGA-1"
## [4] "PBMC_10x_GREENLEAF_REP1:AAACCCATCTCTCGAC-1"
## [5] "PBMC_10x_GREENLEAF_REP1:AAACGAACAATCGTCA-1"
## [6] "PBMC_10x_GREENLEAF_REP1:AAACGAACACGATTCA-1"
Then, we identify the Non-T cell Non-NK cell cells in the scRNA-seq data.
<- colnames(seRNA)[grep(cNonTNK, colData(seRNA)$BioClassification)]
rnaNonTNK head(rnaNonTNK)
## [1] "CD34_32_R5:AAACCTGAGTATCGAA-1" "CD34_32_R5:AAACCTGAGTCGTTTG-1"
## [3] "CD34_32_R5:AAACCTGGTTCCACAA-1" "CD34_32_R5:AAACGGGAGCTTCGCG-1"
## [5] "CD34_32_R5:AAACGGGAGGGAGTAA-1" "CD34_32_R5:AAACGGGAGTTACGGG-1"
To prepare for this constrained integration, we create a nested list. This is a SimpleList
of multiple SimpleList
objects, one for each group that we would like to constrain. In this example, we have two groups: one group called TNK
that identifies the T cells and NK cells across the two platforms, and a second group called NonTNK
that identifies the Non-T cell Non-NK cell cells across the two platforms. Each of these SimpleList
objects has two vectors of cell IDs, one called ATAC
and one called RNA
as shown below:
<- SimpleList(
groupList TNK = SimpleList(
ATAC = projHeme2$cellNames[projHeme2$Clusters %in% clustTNK],
RNA = rnaTNK
),NonTNK = SimpleList(
ATAC = projHeme2$cellNames[projHeme2$Clusters %in% clustNonTNK],
RNA = rnaNonTNK
) )
We pass this list to the groupList
parameter of the addGeneIntegrationMatrix()
function to constrain our integration. Note that, in this case, we are still not adding these results to the Arrow files (addToArrow = FALSE
). We recommend checking the results of the integration thoroughly against your expectations prior to saving the results in the Arrow files. We illustrate this process in the next section of the book.
#~5 minutes
<- addGeneIntegrationMatrix(
projHeme2 ArchRProj = projHeme2,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupList = groupList,
groupRNA = "BioClassification",
nameCell = "predictedCell_Co",
nameGroup = "predictedGroup_Co",
nameScore = "predictedScore_Co"
)## ArchR logging to : ArchRLogs/ArchR-addGeneIntegrationMatrix-371b07e265e98-Date-2022-12-23_Time-06-44-32.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 06:44:33 : Running Seurat's Integration Stuart* et al 2019, 0.018 mins elapsed.
## 2022-12-23 06:44:34 : Checking ATAC Input, 0.021 mins elapsed.
## 2022-12-23 06:44:35 : Checking RNA Input, 0.046 mins elapsed.
## 2022-12-23 06:44:42 : Found 18601 overlapping gene names from gene scores and rna matrix!, 0.169 mins elapsed.
## 2022-12-23 06:44:42 : Creating Integration Blocks, 0.169 mins elapsed.
## 2022-12-23 06:44:43 : Prepping Interation Data, 0.173 mins elapsed.
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2022-12-23 06:44:44 : Computing Integration in 2 Integration Blocks!, 0 mins elapsed.
## 2022-12-23 06:44:44 : Block (1 of 2) : Computing Integration, 0 mins elapsed.
## 2022-12-23 06:44:45 : Block (1 of 2) : Identifying Variable Genes, 0.015 mins elapsed.
## 2022-12-23 06:44:46 : Block (1 of 2) : Getting GeneScoreMatrix, 0.035 mins elapsed.
## 2022-12-23 06:44:51 : Block (1 of 2) : Imputing GeneScoreMatrix, 0.119 mins elapsed.
## Getting ImputeWeights
## Warning: The following arguments are not used: row.names
## 2022-12-23 06:44:56 : Block (1 of 2) : Seurat FindTransferAnchors, 0.203 mins elapsed.
## 2022-12-23 06:45:51 : Block (1 of 2) : Seurat TransferData Cell Group Labels, 1.108 mins elapsed.
## 2022-12-23 06:45:52 : Block (1 of 2) : Seurat TransferData Cell Names Labels, 1.121 mins elapsed.
## 2022-12-23 06:45:56 : Block (1 of 2) : Saving TransferAnchors Joint CCA, 1.195 mins elapsed.
## 2022-12-23 06:45:58 : Block (1 of 2) : Completed Integration, 1.227 mins elapsed.
## 2022-12-23 06:46:00 : Block (2 of 2) : Computing Integration, 1.254 mins elapsed.
## 2022-12-23 06:46:00 : Block (2 of 2) : Identifying Variable Genes, 1.27 mins elapsed.
## 2022-12-23 06:46:02 : Block (2 of 2) : Getting GeneScoreMatrix, 1.296 mins elapsed.
## 2022-12-23 06:46:09 : Block (2 of 2) : Imputing GeneScoreMatrix, 1.416 mins elapsed.
## Getting ImputeWeights
## Warning: The following arguments are not used: row.names
## 2022-12-23 06:46:30 : Block (2 of 2) : Seurat FindTransferAnchors, 1.766 mins elapsed.
## 2022-12-23 06:47:27 : Block (2 of 2) : Seurat TransferData Cell Group Labels, 2.704 mins elapsed.
## 2022-12-23 06:47:28 : Block (2 of 2) : Seurat TransferData Cell Names Labels, 2.732 mins elapsed.
## 2022-12-23 06:47:38 : Block (2 of 2) : Saving TransferAnchors Joint CCA, 2.901 mins elapsed.
## 2022-12-23 06:47:40 : Block (2 of 2) : Completed Integration, 2.935 mins elapsed.
## 2022-12-23 06:47:42 : Block (1 of 2) : Plotting Joint UMAP, 2.963 mins elapsed.
## 2022-12-23 06:48:05 : Block (2 of 2) : Plotting Joint UMAP, 3.341 mins elapsed.
## 2022-12-23 06:48:35 : Completed Integration with RNA Matrix, 3.847 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addGeneIntegrationMatrix-371b07e265e98-Date-2022-12-23_Time-06-44-32.log
10.1.3 Comparing Unconstrained and Constrained Integrations
To compare our unconstrained and constrained integrations, we will color the cells in our scATAC-seq data based on the scRNA-seq cell type identified through integration. To do this, we will create a color palette using the built-in ArchR function paletteDiscrete()
.
<- paletteDiscrete(values = colData(seRNA)$BioClassification)
pal ## Length of unique values greater than palette, interpolating..
In ArchR, a palette is essentially a named vector where the values are hex codes corresponding to the color to be associated with the names.
pal## 01_HSC 02_Early.Eryth 03_Late.Eryth 04_Early.Baso 05_CMP.LMPP
## "#D51F26" "#502A59" "#235D55" "#3D6E57" "#8D2B8B"
## 06_CLP.1 07_GMP 08_GMP.Neut 09_pDC 10_cDC
## "#DE6C3E" "#F9B712" "#D8CE42" "#8E9ACD" "#B774B1"
## 11_CD14.Mono.1 12_CD14.Mono.2 13_CD16.Mono 14_Unk 15_CLP.2
## "#D69FC8" "#C7C8DE" "#8FD3D4" "#89C86E" "#CC9672"
## 16_Pre.B 17_B 18_Plasma 19_CD8.N 20_CD4.N1
## "#CF7E96" "#A27AA4" "#CD4F32" "#6B977E" "#518AA3"
## 21_CD4.N2 22_CD4.M 23_CD8.EM 24_CD8.CM 25_NK
## "#5A5297" "#0F707D" "#5E2E32" "#A95A3C" "#B28D5C"
## 26_Unk
## "#3D3D3D"
We can now visualize the integration by overlaying the scRNA-seq cell types on our scATAC-seq data based on the unconstrained integration.
<- plotEmbedding(
p1
projHeme2, colorBy = "cellColData",
name = "predictedGroup_Un",
pal = pal
)## ArchR logging to : ArchRLogs/ArchR-plotEmbedding-371b025505b29-Date-2022-12-23_Time-06-48-35.log
## If there is an issue, please report to github with logFile!
## Getting UMAP Embedding
## ColorBy = cellColData
## Plotting Embedding
## 1
## ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-371b025505b29-Date-2022-12-23_Time-06-48-35.log
p1
Similarly, we can visualize the integration by overlaying the scRNA-seq cell types on our scATAC-seq data based on the constrained integration.
<- plotEmbedding(
p2
projHeme2, colorBy = "cellColData",
name = "predictedGroup_Co",
pal = pal
)## ArchR logging to : ArchRLogs/ArchR-plotEmbedding-371b022702add-Date-2022-12-23_Time-06-48-39.log
## If there is an issue, please report to github with logFile!
## Getting UMAP Embedding
## ColorBy = cellColData
## Plotting Embedding
## 1
## ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-371b022702add-Date-2022-12-23_Time-06-48-39.log
p2
The differences between these the unconstrained and constrained integration is very subtle in this example, largely because the cell types of interest are already very distinct. However, you should notice differences, especially in the T cell clusters (Clusters 17-22).
To save an editable vectorized version of this plot, we use the plotPDF()
function.
plotPDF(p1,p2, name = "Plot-UMAP-RNA-Integration.pdf", ArchRProj = projHeme2, addDOC = FALSE, width = 5, height = 5)
## Plotting Ggplot!
## Plotting Ggplot!
We can now save our projHeme2
using the saveArchRProject()
function. Here, we set load = TRUE
to make sure that the projHeme2
object tracks with the project that we just saved.
<- saveArchRProject(ArchRProj = projHeme2, outputDirectory = "Save-ProjHeme2", load = TRUE)
projHeme2 ## Copying ArchRProject to new outputDirectory : /corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme2
## Copying Arrow Files...
## Copying Arrow Files (1 of 3)
## Copying Arrow Files (2 of 3)
## Copying Arrow Files (3 of 3)
## Getting ImputeWeights
## Dropping ImputeWeights...
## Copying Other Files...
## Copying Other Files (1 of 5): Embeddings
## Copying Other Files (2 of 5): IterativeLSI
## Copying Other Files (3 of 5): IterativeLSI2
## Copying Other Files (4 of 5): Plots
## Copying Other Files (5 of 5): RNAIntegration
## Saving ArchRProject...
## Loading ArchRProject...
## Successfully loaded ArchRProject!
##
## / |
## / \
## . / |.
## \\\ / |.
## \\\ / `|.
## \\\ / |.
## \ / |\
## \\#####\ / ||
## ==###########> / ||
## \\##==......\ / ||
## ______ = =|__ /__ || \\\
## ,--' ,----`-,__ ___/' --,-`-===================##========>
## \ ' ##_______ _____ ,--,__,=##,__ ///
## , __== ___,-,__,--'#' ===' `-' | ##,-/
## -,____,---' \\####\\________________,--\\_##,/
## ___ .______ ______ __ __ .______
## / \ | _ \ / || | | | | _ \
## / ^ \ | |_) | | ,----'| |__| | | |_) |
## / /_\ \ | / | | | __ | | /
## / _____ \ | |\ \\___ | `----.| | | | | |\ \\___.
## /__/ \__\ | _| `._____| \______||__| |__| | _| `._____|
##