4.2 Iterative Latent Semantic Indexing (LSI)
In scRNA-seq identifying variable genes is a common way to compute dimensionality reduction (such as PCA). This is done because these highly variable genes are more likely to be biologically important and this reduces experimental noise. In scATAC-seq the data is binary and thus you cannot identify variable peaks for dimensionality reduction. Rather than identifying the most variable peaks, we have tried using the most accessible features as input to LSI; however, the results when running multiple samples have shown high degrees of noise and low reproducibility. To remedy this we introduced the “iterative LSI” approach (Satpathy*, Granja* et al. Nature Biotechnology 2019 and Granja*, Klemm* and McGinnis* et al. Nature Biotechnology 2019). This approach computes an inital LSI transformation on the most accessible tiles and identifies lower resolution clusters that are not batch confounded. For example, when performed on peripheral blood mononuclear cells, this will identify clusters corresponding to the major cell types (T cells, B cells, and monocytes). Then ArchR computes the average accessibility for each of these clusters across all features. ArchR then identifies the most variable peaks across these clusters and uses these features for LSI again. In this second iteration, the most variable peaks are more similar to the variable genes used in scRNA-seq LSI implementations. The user can set how many iterations of LSI should be performed. We have found this approach to minimize observed batch effects and allow dimensionality reduction operations on a more reasonably sized feature matrix.
To perform iterative LSI in ArchR, we use the addIterativeLSI()
function. The default parameters should cover most cases but we encourage you to explore the available parameters and how they each affect your particular data set. See ?addIterativeLSI
for more details on inputs. The most common parameters to tweak are iterations
, varFeatures
, and resolution
. It is important to note that LSI is not deterministic. This means that even if you run LSI in exactly the same way with exactly the same parameters, you will not get exactly the same results. Of course, they will be highly similar, but not identical. So make sure to save your ArchRProject
or the relevant LSI information once you’ve settled on an ideal dimensionality reduction.
For the purposes of this tutorial, we will create a reducedDims
object called “IterativeLSI”.
projHeme2 <- addIterativeLSI(
ArchRProj = projHeme2,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.2),
sampleCells = 10000,
n.start = 10
),
varFeatures = 25000,
dimsToUse = 1:30
)
## Checking Inputs…
## ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-ea2d75dd89ad-Date-2020-04-15_Time-09-36-28.log
## If there is an issue, please report to github with logFile!
## 2020-04-15 09:36:29 : Computing Total Accessibility Across All Features, 0.003 mins elapsed.
## 2020-04-15 09:36:35 : Computing Top Features, 0.102 mins elapsed.
## ###########
## 2020-04-15 09:36:35 : Running LSI (1 of 2) on Top Features, 0.111 mins elapsed.
## ###########
## 2020-04-15 09:36:35 : Sampling Cells (N = 10002) for Estimated LSI, 0.112 mins elapsed.
## 2020-04-15 09:36:35 : Creating Sampled Partial Matrix, 0.112 mins elapsed.
## 2020-04-15 09:36:45 : Computing Estimated LSI (projectAll = FALSE), 0.276 mins elapsed.
## 2020-04-15 09:37:22 : Identifying Clusters, 0.887 mins elapsed.
## 2020-04-15 09:37:53 : Identified 5 Clusters, 1.407 mins elapsed.
## 2020-04-15 09:37:53 : Saving LSI Iteration, 1.407 mins elapsed.
## 2020-04-15 09:38:10 : Creating Cluster Matrix on the total Group Features, 1.696 mins elapsed.
## 2020-04-15 09:38:21 : Computing Variable Features, 1.867 mins elapsed.
## ###########
## 2020-04-15 09:38:21 : Running LSI (2 of 2) on Variable Features, 1.871 mins elapsed.
## ###########
## 2020-04-15 09:38:21 : Creating Partial Matrix, 1.871 mins elapsed.
## 2020-04-15 09:38:30 : Computing LSI, 2.018 mins elapsed.
## 2020-04-15 09:39:05 : Finished Running IterativeLSI, 2.606 mins elapsed.
If you see downstream that you have subtle batch effects, another option is to add more LSI iterations and to start from a lower intial clustering resolution as shown below. Additionally the number of variable features can be lowered to increase focus on the more variable features.
We will name this reducedDims
object “IterativeLSI2” for illustrative purposes but we won’t use it downstream.
projHeme2 <- addIterativeLSI(
ArchRProj = projHeme2,
useMatrix = "TileMatrix",
name = "IterativeLSI2",
iterations = 4,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.1, 0.2, 0.4),
sampleCells = 10000,
n.start = 10
),
varFeatures = 15000,
dimsToUse = 1:30
)
## Checking Inputs…
## ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-ea2d349ff558-Date-2020-04-15_Time-09-39-05.log
## If there is an issue, please report to github with logFile!
## 2020-04-15 09:39:06 : Computing Total Accessibility Across All Features, 0.004 mins elapsed.
## 2020-04-15 09:39:09 : Computing Top Features, 0.06 mins elapsed.
## ###########
## 2020-04-15 09:39:10 : Running LSI (1 of 4) on Top Features, 0.07 mins elapsed.
## ###########
## 2020-04-15 09:39:10 : Sampling Cells (N = 10002) for Estimated LSI, 0.071 mins elapsed.
## 2020-04-15 09:39:10 : Creating Sampled Partial Matrix, 0.071 mins elapsed.
## 2020-04-15 09:39:17 : Computing Estimated LSI (projectAll = FALSE), 0.192 mins elapsed.
## 2020-04-15 09:39:42 : Identifying Clusters, 0.611 mins elapsed.
## 2020-04-15 09:40:05 : Identified 4 Clusters, 0.987 mins elapsed.
## 2020-04-15 09:40:05 : Saving LSI Iteration, 0.987 mins elapsed.
## 2020-04-15 09:40:26 : Creating Cluster Matrix on the total Group Features, 1.343 mins elapsed.
## 2020-04-15 09:40:38 : Computing Variable Features, 1.54 mins elapsed.
## ###########
## 2020-04-15 09:40:38 : Running LSI (2 of 4) on Variable Features, 1.542 mins elapsed.
## ###########
## 2020-04-15 09:40:38 : Sampling Cells (N = 10002) for Estimated LSI, 1.544 mins elapsed.
## 2020-04-15 09:40:38 : Creating Sampled Partial Matrix, 1.544 mins elapsed.
## 2020-04-15 09:40:47 : Computing Estimated LSI (projectAll = FALSE), 1.687 mins elapsed.
## 2020-04-15 09:41:09 : Identifying Clusters, 2.048 mins elapsed.
## 2020-04-15 09:41:31 : Identified 7 Clusters, 2.421 mins elapsed.
## 2020-04-15 09:41:31 : Saving LSI Iteration, 2.421 mins elapsed.
## 2020-04-15 09:41:54 : Creating Cluster Matrix on the total Group Features, 2.807 mins elapsed.
## 2020-04-15 09:42:05 : Computing Variable Features, 2.995 mins elapsed.
## ###########
## 2020-04-15 09:42:06 : Running LSI (3 of 4) on Variable Features, 3 mins elapsed.
## ###########
## 2020-04-15 09:42:06 : Sampling Cells (N = 10002) for Estimated LSI, 3.001 mins elapsed.
## 2020-04-15 09:42:06 : Creating Sampled Partial Matrix, 3.001 mins elapsed.
## 2020-04-15 09:42:13 : Computing Estimated LSI (projectAll = FALSE), 3.121 mins elapsed.
## 2020-04-15 09:42:33 : Identifying Clusters, 3.461 mins elapsed.
## 2020-04-15 09:42:55 : Identified 9 Clusters, 3.829 mins elapsed.
## 2020-04-15 09:42:55 : Saving LSI Iteration, 3.829 mins elapsed.
## 2020-04-15 09:43:13 : Creating Cluster Matrix on the total Group Features, 4.129 mins elapsed.
## 2020-04-15 09:43:26 : Computing Variable Features, 4.341 mins elapsed.
## ###########
## 2020-04-15 09:43:26 : Running LSI (4 of 4) on Variable Features, 4.346 mins elapsed.
## ###########
## 2020-04-15 09:43:26 : Creating Partial Matrix, 4.347 mins elapsed.
## 2020-04-15 09:43:33 : Computing LSI, 4.462 mins elapsed.
## 2020-04-15 09:43:55 : Finished Running IterativeLSI, 4.829 mins elapsed.