Chapter 4 Dimensionality Reduction with ArchR
Dimensionality reduction with scATAC-seq is challenging due to the sparsity of the data. In scATAC-seq, a particular site can either be accessible on one allele, both alleles, or no alleles. Even in higher-quality scATAC-seq data, the majority of accesible regions are not transposed and this leads to many loci having 0 accessibile alleles. Moreover, when we see (for example) three Tn5 insertions within a single peak region in a single cell, the sparsity of the data prevents us from confidently determining that this site in this cell is actually three times more accessible than another cell that only has one insertion in the same site. For this reason a lot of analytical strategies work on a binarized scATAC-seq data matrix. This binarized matrix still ends up being mostly 0s because transposition is rare. However, it is important to note that a 0 in scATAC-seq could mean “non-accessible” or “not sampled” and these two inferences are very different from a biological standpoint. Because of this, the 1s have information and the 0s do not. This low information content is what makes our scATAC-seq data sparse.
If you were to perform a standard dimensionality reduction, like Principal Component Analysis, on this sparse insertion counts matrix and plot the top two principal components, you would not obtain the desired result because the sparsity causes high inter-cell similarity at all of the 0 positions. To get around this issue, we use a layered dimensionality reduction approach. First, we use Latent Semantic Indexing (LSI), an approach from natural language processing that was originally designed to assess document similarity based on word counts. This solution was created for natural language processing because the data is sparse and noisy (many different words and many low frequency words). LSI was first introduced for scATAC-seq by Cusanovich et al. (Science 2015). In the case of scATAC-seq, different samples are the documents and different regions/peaks are the words. First, we calculate the term frequency by depth normalization per single cell. These values are then normalized by the inverse document frequency which weights features by how often they occur to identify featres that are more “specific” rather than commonly accessible. The resultant term frequency-inverse document frequency (TF-IDF) matrix reflects how important a word (aka region/peak) is to a document (aka sample). Then, through a technique called singular value decomposition (SVD), the most valuable information across samples is identified and represented in a lower dimensional space. LSI allows you to reduce the dimensionality of the sparse insertion counts matrix from many thousands to tens or hundreds. Then, a more conventional dimensionality reduction technique, such as Uniform Manifold Approximation and Projection (UMAP) or t-distributed stochastic neighbor embedding (t-SNE) can be used to visualize the data. In ArchR, these visualization methods are referred to as embeddings.