21.1 Importing data and setting up a Multiome project

First, it is worth noting that the datasets used in this chapter are different from the datasets used in the other chapters and we will be creating a new ArchRProject here. For this reason, and to make this chapter a standalone, we will repeat operations from other parts of the manual.

Lets start by loading ArchR and setting up our genome (hg38 in this case), our threads (which are system dependent so you should change that based on your computational resources), and our random seed.

library(ArchR)
addArchRGenome("hg38")
## Setting default genome to Hg38.
addArchRThreads(32)
## Setting default number of Parallel threads to 32.
addArchRLocking(locking = TRUE)
## Setting ArchRLocking to TRUE.
set.seed(1)

We can download the files associated with the multiome tutorial just like we did for the main tutorial but using tutorial = "Multiome".

inputFiles <- getTutorialData(tutorial = "Multiome")
## Downloading files to Multiome...

If we inspect inputFiles you can see that we have a .fragments.tsv.gz file and a filtered_feature_bc_matrix.h5 file for each sample. These two files are both standard outputs of the 10x Genomics cellranger-arc pipeline.

inputFiles
##                                            pbmc_sorted_3k 
##                "Multiome/pbmc_sorted_3k.fragments.tsv.gz" 
##                                          pbmc_unsorted_3k 
##              "Multiome/pbmc_unsorted_3k.fragments.tsv.gz" 
##                                            pbmc_sorted_3k 
##   "Multiome/pbmc_sorted_3k.filtered_feature_bc_matrix.h5" 
##                                          pbmc_unsorted_3k 
## "Multiome/pbmc_unsorted_3k.filtered_feature_bc_matrix.h5"

The first thing we need to do is separate out the scATAC-seq input files from the scRNA-seq input files. Here, we are doing that using grep() but you can also do it manually however you want. The important thing is to end up with a named vector of files corresponding to the scATAC-seq data and a named vector of files corresponding to the scRNA-seq data and that the names of the corresponding samples match!

atacFiles <- inputFiles[grep(pattern = "\\.fragments.tsv.gz$", x = inputFiles)]
rnaFiles <- inputFiles[grep(pattern = "\\.filtered_feature_bc_matrix.h5$", x = inputFiles)]

For clarity, the above code is the same as doing this manually:

atacFiles <- c("pbmc_sorted_3k" = "Multiome/pbmc_sorted_3k.fragments.tsv.gz", "pbmc_unsorted_3k" = "Multiome/pbmc_unsorted_3k.fragments.tsv.gz")
rnaFiles <- c("pbmc_sorted_3k" = "Multiome/pbmc_sorted_3k.filtered_feature_bc_matrix.h5", "pbmc_unsorted_3k" = "Multiome/pbmc_unsorted_3k.filtered_feature_bc_matrix.h5")

You can see that the names in these vectors match which is what will be used to link the files together upon import.

names(atacFiles)
## [1] "pbmc_sorted_3k"   "pbmc_unsorted_3k"
names(rnaFiles)
## [1] "pbmc_sorted_3k"   "pbmc_unsorted_3k"
all.equal(names(atacFiles), names(rnaFiles))
## [1] TRUE

As in the main tutorial, the first step is to create ArrowFiles from the scATAC-seq fragment files.

ArrowFiles <- createArrowFiles(
  inputFiles = atacFiles,
  sampleNames = names(atacFiles),
  minTSS = 4,
  minFrags = 1000,
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## ArchR logging to : ArchRLogs/ArchR-createArrows-371b0572fde0f-Date-2022-12-23_Time-09-40-36.log
## If there is an issue, please report to github with logFile!
## Cleaning Temporary Files
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2022-12-23 09:40:38 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-createArrows-371b0572fde0f-Date-2022-12-23_Time-09-40-36.log

Then we create an ArchRProject object from those ArrowFiles

projMulti1 <- ArchRProject(ArrowFiles = ArrowFiles)
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## Validating Arrows...
## Getting SampleNames...
## 
## Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE
## 1 2 
## Getting Cell Metadata...
## 
## Merging Cell Metadata...
## Initializing ArchRProject...
## 
##                                                    / |
##                                                  /    \
##             .                                  /      |.
##             \\\                              /        |.
##               \\\                          /           `|.
##                 \\\                      /              |.
##                   \                    /                |\
##                   \\#####\           /                  ||
##                 ==###########>      /                   ||
##                  \\##==......\    /                     ||
##             ______ =       =|__ /__                     ||      \\\
##         ,--' ,----`-,__ ___/'  --,-`-===================##========>
##        \               '        ##_______ _____ ,--,__,=##,__   ///
##         ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
##         -,____,---'       \\####\\________________,--\\_##,/
##            ___      .______        ______  __    __  .______      
##           /   \     |   _  \      /      ||  |  |  | |   _  \     
##          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
##         /  /_\  \   |      /     |  |     |   __   | |      /     
##        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
##       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|
## 

Now that we’ve handled the scATAC-seq data, we turn out attention to the scRNA-seq data. ArchR provides the import10xFeatureMatrix() function which will perform this automatically for all of the filtered_feature_bc_matrix.h5 files in your data set. While this process may seem straightforward, there are a few things going on under-the-hood which are important to understand. First, the filtered_feature_bc_matrix.h5 created by cellranger-arc are not always perfectly matched across samples. For example, ArchR will throw an error if your various scRNA-seq input files do not match based on the gene names or the metadata columns because we view this as an unforseen incompatibility between data, potentially caused by alignment to different reference genomes or something similar. There are also more permissible mismatches that occur across these input files. Relatively frequently, the transcript-level data that is associated with each gene can vary slightly across samples for reasons that aren’t completely clear. For more information, see this post. Because of this, you have to decide what to do when these minor conflicts arise. We provide two options which are controlled by the strictMatch parameter. If you want to try to keep the genes with mis-matched metadata information, set strictMatch = FALSE which will coerce all samples to match the metadata information of the first sample from your input. Alternatively, if you would prefer to remove genes whose metadata is mis-matched across samples, you can set strictMatch = TRUE which will remove the offending genes from all samples. This is often a very small number of genes so it is unlikely to affect your analysis either way.

seRNA <- import10xFeatureMatrix(
  input = rnaFiles,
  names = names(rnaFiles),
  strictMatch = TRUE
)
## Importing Feature Matrix 1 of 2
## Merging individual RNA objects...
## 
## Merging pbmc_unsorted_3k
## Importing Feature Matrix 2 of 2
## Warning in import10xFeatureMatrix(input = rnaFiles, names = names(rnaFiles), : These features had no interval present and were given a fake GRanges location on `chrNA`:
## 
## MT-ND1,MT-ND2,MT-CO1,MT-CO2,MT-ATP8,MT-ATP6,MT-CO3,MT-ND3,MT-ND4L,MT-ND4,MT-ND5,MT-ND6,MT-CYB
## 
## Either modify the output SE or see `features` input to use a reference GRanges to add to these features!

In this tutorial data, there actually aren’t any examples of this transcript-level metadata mismatch. But if there were, ArchR would output warnings to tell you about the genes that were being excluded or coerced. However, what you do see is a warning that some of the features (genes) did not have an “interval” (or genomic location) present in the 10x input file. This is most frequently the case for genes encoded on the mitochondrial DNA but could in theory happen for other genes in other species as well. This causes problems for ArchR because we assume that every gene has a position in the genome. To get around this problem, ArchR will default to assigning these genes a fake genomic position on “chrNA”. All of these genes will then get excluded from downstream analyses. If you want to rescue these genes, then you must provide this information via the features parameter. This is most easily done using an Ensembl database object from BioConductor. The below code assumes that you have installed the EnsDb.Hsapiens.v86 package from BioConductor.

library(EnsDb.Hsapiens.v86)
seRNA <- import10xFeatureMatrix(
  input = rnaFiles,
  names = names(rnaFiles),
  strictMatch = TRUE,
  features = genes(EnsDb.Hsapiens.v86)
)
## Importing Feature Matrix 1 of 2
## Correcting missing intervals...
## Merging individual RNA objects...
## 
## Merging pbmc_unsorted_3k
## Importing Feature Matrix 2 of 2
## Correcting missing intervals...

After supplying the genes to the features param, you can see that all of the genes encoded on the mitochondrial genome are rescued.

The next thing that we want to do is add this scRNA-seq data to our ArchRProject via the addGeneExpressionMatrix() function. However, there is some tidying up that we need to do before we are ready to do that. Inherent in how the multiomic data is generated, you will likely have cells that pass scRNA-seq quality control but not scATAC-seq quality control and vice versa. This can cause problems downstream with other ArchR functions that expect every cell to have every data type. For example, if you were to add gene expression data for only a subset of cells and then try to perform addIterativeLSI() on the corresponding GeneExpressionMatrix, ArchR would not know what to do with the cells that were missing data in the GeneExpressionMatrix. Because our ArchRProject is anchored in the scATAC-seq data, we first check to see how many cells in our project (the ones that have passed scATAC-seq quality control) are not also in our scRNA-seq data

length(which(getCellNames(projMulti1) %ni% colnames(seRNA)))
## [1] 119

Though not many cells in this case, we need to remove these cells from our project before proceeding. We do this using the subsetArchRProject() function and saving this as a new ArchRProject called projMulti2. Again, this removal isn’t explicitly necessary but there are downstream ArchR functions that could break if all cells dont have both data types so the safest thing to do is remove them.

cellsToKeep <- which(getCellNames(projMulti1) %in% colnames(seRNA))
projMulti2 <- subsetArchRProject(ArchRProj = projMulti1, cells = getCellNames(projMulti1)[cellsToKeep], outputDirectory = "Save-ProjMulti2", force = TRUE)
## Copying ArchRProject to new outputDirectory : /corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjMulti2
## Copying Arrow Files...
## Getting ImputeWeights
## No imputeWeights found, returning NULL
## Copying Other Files...
## Saving ArchRProject...
## Loading ArchRProject...
## Successfully loaded ArchRProject!
## 
##                                                    / |
##                                                  /    \
##             .                                  /      |.
##             \\\                              /        |.
##               \\\                          /           `|.
##                 \\\                      /              |.
##                   \                    /                |\
##                   \\#####\           /                  ||
##                 ==###########>      /                   ||
##                  \\##==......\    /                     ||
##             ______ =       =|__ /__                     ||      \\\
##         ,--' ,----`-,__ ___/'  --,-`-===================##========>
##        \               '        ##_______ _____ ,--,__,=##,__   ///
##         ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
##         -,____,---'       \\####\\________________,--\\_##,/
##            ___      .______        ______  __    __  .______      
##           /   \     |   _  \      /      ||  |  |  | |   _  \     
##          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
##         /  /_\  \   |      /     |  |     |   __   | |      /     
##        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
##       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|
## 

Finally, we are ready to add the gene expression data to our project. You’ll notice here another parameter called strictMatch. When strictMatch = TRUE, this will ensure that all cells in the ArchRProject are also represented in the seRNA object. This is a nice fail-safe to make sure that the above project subsetting went smoothly. If strictMatch = FALSE (the default), then this function will merely throw a warning telling you that not all of your cells have scRNA-seq information and that this could cause problems downstream.

projMulti2 <- addGeneExpressionMatrix(input = projMulti2, seRNA = seRNA, strictMatch = TRUE, force = TRUE)
## ArchR logging to : ArchRLogs/ArchR-addGeneExpressionMatrix-371b0156a1c81-Date-2022-12-23_Time-09-43-33.log
## If there is an issue, please report to github with logFile!
## Overlap w/ scATAC = 1
## 2022-12-23 09:43:34 :
## Overlap Per Sample w/ scATAC : pbmc_sorted_3k=2626,pbmc_unsorted_3k=2845
## 2022-12-23 09:43:34 :
## 2022-12-23 09:43:36 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addGeneExpressionMatrix-371b0156a1c81-Date-2022-12-23_Time-09-43-33.log

The last thing that we will do during this project setup phase is filter out any doublets.

projMulti2 <- addDoubletScores(projMulti2, force = TRUE)
## ArchR logging to : ArchRLogs/ArchR-addDoubletScores-371b03f48034a-Date-2022-12-23_Time-09-45-10.log
## If there is an issue, please report to github with logFile!
## 2022-12-23 09:45:12 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2022-12-23 09:45:12 : pbmc_sorted_3k (1 of 2) :  Computing Doublet Statistics, 0 mins elapsed.
## Warning: The following arguments are not used: row.names
## pbmc_sorted_3k (1 of 2) : UMAP Projection R^2 = 0.99705
## pbmc_sorted_3k (1 of 2) : UMAP Projection R^2 = 0.99705
## 2022-12-23 09:46:57 : pbmc_unsorted_3k (2 of 2) :  Computing Doublet Statistics, 1.757 mins elapsed.
## Warning: The following arguments are not used: row.names
## pbmc_unsorted_3k (2 of 2) : UMAP Projection R^2 = 0.996
## pbmc_unsorted_3k (2 of 2) : UMAP Projection R^2 = 0.996
## ArchR logging successful to : ArchRLogs/ArchR-addDoubletScores-371b03f48034a-Date-2022-12-23_Time-09-45-10.log
projMulti2 <- filterDoublets(projMulti2)
## Filtering 148 cells from ArchRProject!
##  pbmc_unsorted_3k : 80 of 2845 (2.8%)
##  pbmc_sorted_3k : 68 of 2626 (2.6%)