Create Arrow Files — createArrowFiles • ArchR

This function will create ArrowFiles from input files. These ArrowFiles are the main constituent for downstream analysis in ArchR.

createArrowFiles(
  inputFiles = NULL,
  sampleNames = names(inputFiles),
  outputNames = sampleNames,
  validBarcodes = NULL,
  geneAnnotation = getGeneAnnotation(),
  genomeAnnotation = getGenomeAnnotation(),
  minTSS = 4,
  minFrags = 1000,
  maxFrags = 1e+05,
  minFragSize = 10,
  maxFragSize = 2000,
  QCDir = "QualityControl",
  nucLength = 147,
  promoterRegion = c(2000, 100),
  TSSParams = list(),
  excludeChr = c("chrM", "chrY"),
  nChunk = 5,
  bcTag = "qname",
  gsubExpression = NULL,
  bamFlag = NULL,
  offsetPlus = 4,
  offsetMinus = -5,
  addTileMat = TRUE,
  TileMatParams = list(),
  addGeneScoreMat = TRUE,
  GeneScoreMatParams = list(),
  force = FALSE,
  threads = getArchRThreads(),
  parallelParam = NULL,
  subThreading = TRUE,
  verbose = TRUE,
  cleanTmp = TRUE,
  logFile = createLogFile("createArrows"),
  filterFrags = NULL,
  filterTSS = NULL
)

Arguments

inputFiles: A character vector containing the paths to the input files to use to generate the ArrowFiles. These files can be in one of the following formats: (i) scATAC tabix files, (ii) fragment files, or (iii) bam files.
sampleNames: A character vector containing the names to assign to the samples that correspond to the inputFiles. Each input file should receive a unique sample name. This list should be in the same order as inputFiles.
outputNames: The prefix to use for output files. Each input file should receive a unique output file name. This list should be in the same order as "inputFiles". For example, if the predix is "PBMC" the output file will be named "PBMC.arrow"
validBarcodes: A list of valid barcode strings to be used for filtering cells read from each input file (see getValidBarcodes() for 10x fragment files).
geneAnnotation: The geneAnnotation (see createGeneAnnotation()) to associate with the ArrowFiles. This is used downstream to calculate TSS Enrichment Scores etc.
genomeAnnotation: The genomeAnnotation (see createGenomeAnnotation()) to associate with the ArrowFiles. This is used downstream to collect chromosome sizes and nucleotide information etc.
minTSS: The minimum numeric transcription start site (TSS) enrichment score required for a cell to pass filtering for use in downstream analyses. Cells with a TSS enrichment score greater than or equal to minTSS will be retained. TSS enrichment score is a measurement of signal-to-background in ATAC-seq.
minFrags: The minimum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses. Cells containing greater than or equal to minFrags total fragments will be retained.
maxFrags: The maximum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses. Cells containing greater than or equal to maxFrags total fragments will be retained.
minFragSize: The minimum fragment size to be included into Arrow File. Fragments lower than this number are discarded. Must be less than maxFragSize.
maxFragSize: The maximum fragment size to be included into Arrow File. Fragments above than this number are discarded. Must be greater than maxFragSize.
QCDir: The relative path to the output directory for QC-level information and plots for each sample/ArrowFile.
nucLength: The length in basepairs that wraps around a nucleosome. This number is used for identifying fragments as sub-nucleosome-spanning, mono-nucleosome-spanning, or multi-nucleosome-spanning.
promoterRegion: A integer vector describing the number of basepairs upstream and downstream c(upstream, downstream) of the TSS to include as the promoter region for downstream calculation of things like the fraction of reads in promoters (FIP).
TSSParams: A list of parameters for computing TSS Enrichment scores. This includes the window which is the size in basepairs of the window centered at each TSS (default 101), the flank which is the size in basepairs of the flanking window (default 2000), and the norm which describes the size in basepairs of the flank window to be used for normalization of the TSS enrichment score (default 100). For example, given window = 101, flank = 2000, norm = 100, the accessibility within the 101-bp surrounding the TSS will be normalized to the accessibility in the 100-bp bins from -2000 bp to -1901 bp and 1901:2000.
excludeChr: A character vector containing the names of chromosomes to be excluded from downstream analyses. In most human/mouse analyses, this includes the mitochondrial DNA (chrM) and the male sex chromosome (chrY). This does, however, not exclude the corresponding fragments from being stored in the ArrowFile.
nChunk: The number of chunks to divide each chromosome into to allow for low-memory parallelized reading of the inputFiles. Higher numbers reduce memory usage but increase compute time.
bcTag: The name of the field in the input bam file containing the barcode tag information. See ScanBam in Rsamtools.
gsubExpression: A regular expression used to clean up the barcode tag string read in from a bam file. For example, if the barcode is appended to the readname or qname field like for the mouse atlas data from Cusanovic* and Hill* et al. (2018), the gsubExpression would be ":.*". This would retrieve the string after the colon as the barcode.
bamFlag: A vector of bam flags to be used for reading in fragments from input bam files. Should be in the format of a scanBamFlag passed to ScanBam in Rsamtools.
offsetPlus: The numeric offset to apply to the start (left-most Tn5 insertion) of a fragment to account for the precise Tn5 binding site. This parameter only applies to bam file input and it is assumed that fragment files have already been offset which is the standard from 10x output. See Buenrostro et al. Nature Methods 2013.
offsetMinus: The numeric offset to apply to the end (right-most Tn5 insertion) of a fragment to account for the precise Tn5 binding site. This parameter only applies to bam file input and it is assumed that fragment files have already been offset which is the standard from 10x output. See Buenrostro et al. Nature Methods 2013.
addTileMat: A boolean value indicating whether to add a "Tile Matrix" to each ArrowFile. A Tile Matrix is a counts matrix that, instead of using peaks, uses a fixed-width sliding window of bins across the whole genome. This matrix can be used in many downstream ArchR operations.
TileMatParams: A list of parameters to pass to the addTileMatrix() function. See addTileMatrix() for options.
addGeneScoreMat: A boolean value indicating whether to add a Gene-Score Matrix to each ArrowFile. A Gene-Score Matrix uses ATAC-seq signal proximal to the TSS to estimate gene activity.
GeneScoreMatParams: A list of parameters to pass to the addGeneScoreMatrix() function. See addGeneScoreMatrix() for options.
force: A boolean value indicating whether to force ArrowFiles to be overwritten if they already exist.
threads: The number of threads to be used for parallel computing.
parallelParam: A list of parameters to be passed for biocparallel/batchtools parallel computing.
subThreading: A boolean determining whether possible use threads within each multi-threaded subprocess if greater than the number of input samples.
verbose: A boolean value that determines whether standard output should be printed.
logFile: The path to a file to be used for logging ArchR output.
cleamTmp: A boolean value that determines whether to clean temp folder of all intermediate ".arrow" files.

Examples


# Get Test Fragments
fragments <- getTestFragments()

# Create Arrow Files
arrowFiles <- createArrowFiles(
  inputFiles = fragments,
  sampleNames = "PBSmall",
  minFrags = 100,
  nChunk = 1,
  TileMatParams=list(tileSize=10000),
  force = TRUE
)