6.2 Iterative Latent Semantic Indexing (LSI)

In scRNA-seq identifying variable genes is a common way to compute dimensionality reduction (such as PCA). This is done because these highly variable genes are more likely to be biologically important and this reduces experimental noise. In scATAC-seq, the data is binary and thus you cannot identify variable peaks for dimensionality reduction. Rather than identifying the most variable peaks, we have tried using the most accessible features as input to LSI; however, the results when running multiple samples have shown high degrees of noise and low reproducibility. To remedy this we introduced the “iterative LSI” approach (Satpathy*, Granja* et al. Nature Biotechnology 2019 and Granja*, Klemm* and McGinnis* et al. Nature Biotechnology 2019). This approach computes an inital LSI transformation on the most accessible tiles and identifies lower resolution clusters that are not batch confounded. For example, when performed on peripheral blood mononuclear cells, this will identify clusters corresponding to the major cell types (T cells, B cells, and monocytes). Then ArchR computes the average accessibility for each of these clusters across all features. ArchR then identifies the most variable peaks across these clusters and uses these features for LSI again. In this second iteration, the most variable peaks are more similar to the variable genes used in scRNA-seq LSI implementations. The user can set how many iterations of LSI should be performed. We have found this approach to minimize observed batch effects and allow dimensionality reduction operations on a more reasonably sized feature matrix.

To perform iterative LSI in ArchR, we use the addIterativeLSI() function. The default parameters should cover most cases but we encourage you to explore the available parameters and how they each affect your particular data set. See ?addIterativeLSI for more details on inputs. The most common parameters to tweak are iterations, varFeatures, and resolution.

For the purposes of this tutorial, we will create a reducedDims object called “IterativeLSI”.

projHeme2 <- addIterativeLSI(
    ArchRProj = projHeme2,
    useMatrix = "TileMatrix", 
    name = "IterativeLSI", 
    iterations = 2, 
    clusterParams = list( #See Seurat::FindClusters
        resolution = c(0.2), 
        sampleCells = 10000, 
        n.start = 10
    ), 
    varFeatures = 25000, 
    dimsToUse = 1:30
)
## Checking Inputs...
## ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-371b04770b6f1-Date-2022-12-23_Time-05-55-38.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 05:55:41 : Computing Total Across All Features, 0.023 mins elapsed.
## 2022-12-23 05:55:42 : Computing Top Features, 0.042 mins elapsed.
## ###########
## 2022-12-23 05:55:45 : Running LSI (1 of 2) on Top Features, 0.086 mins elapsed.
## ###########
## 2022-12-23 05:55:45 : Sampling Cells (N = 10001) for Estimated LSI, 0.087 mins elapsed.
## 2022-12-23 05:55:45 : Creating Sampled Partial Matrix, 0.087 mins elapsed.
## 2022-12-23 05:55:51 : Computing Estimated LSI (projectAll = FALSE), 0.198 mins elapsed.
## 2022-12-23 05:56:23 : Identifying Clusters, 0.721 mins elapsed.
## Warning: The following arguments are not used: row.names
## 2022-12-23 05:56:37 : Identified 5 Clusters, 0.959 mins elapsed.
## 2022-12-23 05:56:37 : Saving LSI Iteration, 0.959 mins elapsed.
## 2022-12-23 05:56:54 : Creating Cluster Matrix on the total Group Features, 1.244 mins elapsed.
## 2022-12-23 05:57:10 : Computing Variable Features, 1.508 mins elapsed.
## ###########
## 2022-12-23 05:57:10 : Running LSI (2 of 2) on Variable Features, 1.51 mins elapsed.
## ###########
## 2022-12-23 05:57:10 : Creating Partial Matrix, 1.51 mins elapsed.
## 2022-12-23 05:57:17 : Computing LSI, 1.621 mins elapsed.
## 2022-12-23 05:57:41 : Finished Running IterativeLSI, 2.03 mins elapsed.

ArchR automatically checks each dimension to determine whether it is highly correlated to the sequencing depth. The corCutOff parameter sets the threshold for this correlation before exclusion of a particular dimension. In some cases, the 1st dimension may be correlated to other technical noise not depth related. In general, ArchR defaults are reasonable and dont need to be changed. However, if you think that your results make more sense if you manually exclude the first dimension, that is a reasonable thing to do. Biological intuition is important to adequately evaluate the results of dimensionality reduction and if removing a specific dimensions steers you closer to your expectation than that is fine. In most cases, the exclusion of a specific dimension doesnt have a strong effect because of the way that the iterative LSI method works compared to non-iterative implementations (e.g. in Signac). To manually exclude a specific dimension, you would alter the dimsToUse parameter.

If you see downstream that you have subtle batch effects, another option is to add more LSI iterations and to start from a lower intial clustering resolution as shown below. Additionally the number of variable features can be lowered to increase focus on the more variable features.

We will name this reducedDims object “IterativeLSI2” for illustrative purposes but we won’t use it downstream.

projHeme2 <- addIterativeLSI(
    ArchRProj = projHeme2,
    useMatrix = "TileMatrix", 
    name = "IterativeLSI2", 
    iterations = 4, 
    clusterParams = list( #See Seurat::FindClusters
        resolution = c(0.1, 0.2, 0.4), 
        sampleCells = 10000, 
        n.start = 10
    ), 
    varFeatures = 15000, 
    dimsToUse = 1:30
)
## Checking Inputs...
## ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-371b070a06410-Date-2022-12-23_Time-05-57-41.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 05:57:44 : Computing Total Across All Features, 0.022 mins elapsed.
## 2022-12-23 05:57:45 : Computing Top Features, 0.041 mins elapsed.
## ###########
## 2022-12-23 05:57:47 : Running LSI (1 of 4) on Top Features, 0.078 mins elapsed.
## ###########
## 2022-12-23 05:57:47 : Sampling Cells (N = 10001) for Estimated LSI, 0.079 mins elapsed.
## 2022-12-23 05:57:47 : Creating Sampled Partial Matrix, 0.079 mins elapsed.
## 2022-12-23 05:57:53 : Computing Estimated LSI (projectAll = FALSE), 0.183 mins elapsed.
## 2022-12-23 05:58:19 : Identifying Clusters, 0.612 mins elapsed.
## Warning: The following arguments are not used: row.names
## 2022-12-23 05:58:33 : Identified 4 Clusters, 0.842 mins elapsed.
## 2022-12-23 05:58:33 : Saving LSI Iteration, 0.842 mins elapsed.
## 2022-12-23 05:58:50 : Creating Cluster Matrix on the total Group Features, 1.126 mins elapsed.
## 2022-12-23 05:59:06 : Computing Variable Features, 1.389 mins elapsed.
## ###########
## 2022-12-23 05:59:06 : Running LSI (2 of 4) on Variable Features, 1.391 mins elapsed.
## ###########
## 2022-12-23 05:59:06 : Sampling Cells (N = 10001) for Estimated LSI, 1.392 mins elapsed.
## 2022-12-23 05:59:06 : Creating Sampled Partial Matrix, 1.392 mins elapsed.
## 2022-12-23 05:59:12 : Computing Estimated LSI (projectAll = FALSE), 1.492 mins elapsed.
## 2022-12-23 05:59:31 : Identifying Clusters, 1.813 mins elapsed.
## Warning: The following arguments are not used: row.names
## 2022-12-23 05:59:46 : Identified 6 Clusters, 2.06 mins elapsed.
## 2022-12-23 05:59:46 : Saving LSI Iteration, 2.06 mins elapsed.
## 2022-12-23 06:00:06 : Creating Cluster Matrix on the total Group Features, 2.388 mins elapsed.
## 2022-12-23 06:00:21 : Computing Variable Features, 2.649 mins elapsed.
## ###########
## 2022-12-23 06:00:21 : Running LSI (3 of 4) on Variable Features, 2.65 mins elapsed.
## ###########
## 2022-12-23 06:00:21 : Sampling Cells (N = 10001) for Estimated LSI, 2.652 mins elapsed.
## 2022-12-23 06:00:21 : Creating Sampled Partial Matrix, 2.652 mins elapsed.
## 2022-12-23 06:00:27 : Computing Estimated LSI (projectAll = FALSE), 2.752 mins elapsed.
## 2022-12-23 06:00:44 : Identifying Clusters, 3.027 mins elapsed.
## Warning: The following arguments are not used: row.names
## 2022-12-23 06:00:59 : Identified 9 Clusters, 3.275 mins elapsed.
## 2022-12-23 06:00:59 : Saving LSI Iteration, 3.275 mins elapsed.
## 2022-12-23 06:01:18 : Creating Cluster Matrix on the total Group Features, 3.598 mins elapsed.
## 2022-12-23 06:01:34 : Computing Variable Features, 3.863 mins elapsed.
## ###########
## 2022-12-23 06:01:34 : Running LSI (4 of 4) on Variable Features, 3.865 mins elapsed.
## ###########
## 2022-12-23 06:01:34 : Creating Partial Matrix, 3.866 mins elapsed.
## 2022-12-23 06:01:40 : Computing LSI, 3.967 mins elapsed.
## 2022-12-23 06:02:00 : Finished Running IterativeLSI, 4.3 mins elapsed.

You can list the available reducedDims objects in an ArchRProject using the slot extraction opperator @:

projHeme2@reducedDims
## List of length 2
## names(2): IterativeLSI IterativeLSI2