addIterativeLSI.Rd
This function will compute an iterative LSI dimensionality reduction on an ArchRProject.
addIterativeLSI(
ArchRProj = NULL,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list(resolution = c(2), sampleCells = 10000, maxClusters = 6, n.start =
10),
firstSelection = "top",
depthCol = "nFrags",
varFeatures = 25000,
dimsToUse = 1:30,
LSIMethod = 2,
scaleDims = TRUE,
corCutOff = 0.75,
binarize = TRUE,
outlierQuantiles = c(0.02, 0.98),
filterBias = TRUE,
sampleCellsPre = 10000,
projectCellsPre = FALSE,
sampleCellsFinal = NULL,
selectionMethod = "var",
scaleTo = 10000,
totalFeatures = 5e+05,
filterQuantile = 0.995,
excludeChr = c(),
keep0lsi = FALSE,
saveIterations = TRUE,
UMAPParams = list(n_neighbors = 40, min_dist = 0.4, metric = "cosine", verbose = FALSE,
fast_sgd = TRUE),
nPlot = 10000,
outDir = getOutputDirectory(ArchRProj),
threads = getArchRThreads(),
seed = 1,
verbose = TRUE,
force = FALSE,
logFile = createLogFile("addIterativeLSI")
)
An ArchRProject
object.
The name of the data matrix to retrieve from the ArrowFiles associated with the ArchRProject
. Valid options are
"TileMatrix" or "PeakMatrix".
The name to use for storage of the IterativeLSI dimensionality reduction in the ArchRProject
as a reducedDims
object.
The number of LSI iterations to perform.
A list of additional parameters to be passed to addClusters()
for clustering within each iteration.
These params can be constant across each iteration, or specified for each iteration individually. Thus each param must be of
length == 1 or the total number of iterations
- 1. If you want to use scran
for clustering, you would pass this as method="scran"
.
First iteration selection method for features to use for LSI. Either "Top" for the top accessible/average or "Var" for the top variable features. "Top" should be used for all scATAC-seq data (binary) while "Var" should be used for all scRNA/other-seq data types (non-binary).
A column in the ArchRProject
that represents the coverage (scATAC = unique fragments, scRNA = unique molecular identifiers) per cell.
These values are used to minimize the related biases in the reduction related. For scATAC we recommend "nFrags" and for scRNA we recommend "Gex_nUMI".
The number of N variable features to use for LSI. The top N features will be used based on the selectionMethod
.
A vector containing the dimensions to use in LSI. The total dimensions used in LSI will be max(dimsToUse)
. If you set this too high,
it could impact downstream functionalities including increasing the time required to run addClusters()
.
A number or string indicating the order of operations in the TF-IDF normalization. Possible values are: 1 or "tf-logidf", 2 or "log(tf-idf)", and 3 or "logtf-logidf".
A boolean that indicates whether to z-score the reduced dimensions for each cell. This is useful forminimizing the contribution of strong biases (dominating early PCs) and lowly abundant populations. However, this may lead to stronger sample-specific biases since it is over-weighting latent PCs.
A numeric cutoff for the correlation of each dimension to the sequencing depth. If the dimension has a correlation to
sequencing depth that is greater than the corCutOff
, it will be excluded from analysis.
A boolean value indicating whether the matrix should be binarized before running LSI. This is often desired when working with insertion counts.
Two numerical values (between 0 and 1) that describe the lower and upper quantiles of bias (number of acessible regions per cell, determined
by nFrags
or colSums
) to filter cells prior to LSI. For example a value of c(0.02, 0.98) results in the cells in the bottom 2 percent and upper 98 percent to be
filtered prior to LSI. These cells are then projected back in the LSI subspace. This prevents spurious 'islands' that are identified based on being extremely biased.
These quantiles are also used for sub-sampled LSI when determining which cells are used.
A boolean indicating whether to drop bias clusters when computing clusters during iterativeLSI.
An integer specifying the number of cells to sample in iterations prior to the last in order to perform a sub-sampled LSI and sub-sampled clustering. This greatly reduced memory usage and increases speed for early iterations.
A boolean indicating whether to reproject all cells into the sub-sampled LSI (see sampleCellsPre
). Setting this to FALSE
allows for using the sub-sampled LSI for clustering and variance identification. If TRUE
the cells are all projected into the sub-sampled LSI
and used for cluster and variance identification.
An integer specifying the number of cells to sample in order to perform a sub-sampled LSI in final iteration.
The selection method to be used for identifying the top variable features. Valid options are "var" for log-variability or "vmr" for variance-to-mean ratio.
Each column in the matrix designated by useMatrix
will be normalized to a column sum designated by scaleTo
prior to
variance calculation and TF-IDF normalization.
The number of features to consider for use in LSI after ranking the features by the total number of insertions.
These features are the only ones used throught the variance identification and LSI. These are an equivalent when using a TileMatrix
to a defined peakSet.
A number 0,1 that indicates the quantile above which features should be removed based on insertion counts prior
to the first iteration of the iterative LSI paradigm. For example, if filterQuantile = 0.99
, any features above the 99th percentile in
insertion counts will be ignored for the first LSI iteration.
A string of chromosomes to exclude for iterativeLSI procedure.
A boolean whether to keep cells with no reads in features used for LSI.
A boolean value indicating whether the results of each LSI iterations should be saved as compressed .rds
files in
the designated outDir
.
The list of parameters to pass to the UMAP function if "UMAP" if saveIterations=TRUE
. See the function uwot::umap()
.
If saveIterations=TRUE
, how many cells to sample make a UMAP and plot for each iteration.
The output directory for saving LSI iterations if desired. Default is in the outputDirectory
of the ArchRProject
.
The number of threads to be used for parallel computing.
A number to be used as the seed for random number generation. It is recommended to keep track of the seed used so that you can reproduce results downstream.
A boolean value that determines whether standard output includes verbose sections.
A boolean value that indicates whether or not to overwrite relevant data in the ArchRProject
object.
The path to a file to be used for logging ArchR output.
# Get Test ArchR Project
proj <- getTestProject()
# Iterative LSI
proj <- addIterativeLSI(proj, dimsToUse = 1:5, varFeatures=1000, iterations = 2, force=TRUE)