5.6 Filtering Doublets from an ArchRProject

After we have added information on the predicted doublets using addDoubletScores(), we can remove these predicted doublets using filterDoublets(). One of the key elements of this filtering step is the filterRatio which is the maximum ratio of predicted doublets to filter based on the number of pass-filter cells. For example, if there are 5000 cells, the maximum number of filtered predicted doublets would be filterRatio * 5000^2 / (100000) (which simplifies to filterRatio * 5000 * 0.05). This filterRatio allows you to apply a consistent filter across multiple different samples that may have different percentages of doublets because they were run with different cell loading concentrations. The higher the filterRatio, the greater the number of cells potentially removed as doublets.

The other parameters that can be tweaked for doublet filtration are cutEnrich and cutScore. The cutEnrich is equivalent to the number of simulated doublets identified as a nearest neighbor to the cell divided by the expected number given a random uniform distribution. The cutScore refers to the -log10(pval) of the enrichment and we have found this to be a worse predictor of doublets than cutEnrich. Params for doublet filtering should be sensible and thats why the filter ratio is set to ensure that you do not over filter. If you were to plot the distribution of DoubletEnrichment or DoubletScore (which are stored in cellColData), you should see a small population with high score or enrichment and these represent putative doublets. The goal here is to filter as many doublets as possible while not removing true singlets.

To filter doublets, we use the filterDoublets() function. We save the output of this function call as a new ArchRProject for the purposes of this stepwise tutorial but you can always overwrite your original ArchRProject object.

projHeme2 <- filterDoublets(projHeme1)
## Filtering 410 cells from ArchRProject!
##  scATAC_BMMC_R1 : 243 of 4932 (4.9%)
##  scATAC_CD34_BMMC_R1 : 107 of 3275 (3.3%)
##  scATAC_PBMC_R1 : 60 of 2453 (2.4%)

We can compare the number of cells in projHeme1 (pre-doublet-removal) and in projHeme2 (post-doublet-removal) and see that some cells have been removed during the doublet filtration process.

length(getCellNames(ArchRProj = projHeme1))
## [1] 10660
length(getCellNames(ArchRProj = projHeme2))
## [1] 10250

If you wanted to filter more cells from the ArchR Project, you would use a higher filterRatio or alternatively tweak cutEnrich or cutScore as described above.

projHemeTmp <- filterDoublets(projHeme1, filterRatio = 1.5)
## Filtering 614 cells from ArchRProject!
##  scATAC_BMMC_R1 : 364 of 4932 (7.4%)
##  scATAC_CD34_BMMC_R1 : 160 of 3275 (4.9%)
##  scATAC_PBMC_R1 : 90 of 2453 (3.7%)

Since projHemeTmp was only created for illustrative purposes, we remove it from our R session.

rm(projHemeTmp)