3.6 Creating Arrow Files

For the remainder of this tutorial, we will use data from a downsampled dataset of hematopoietic cells from Granja* et al. Nature Biotechnology 2019. This includes data from bone marrow mononuclear cells (BMMC), peripheral blood mononuclear cells (PBMC), and CD34+ hematopoietic stem and progenitor cells from bone marrow (CD34 BMMC).

This data is downloaded as fragment files which contain the start and end genomic coordinates of all aligned sequenced fragments. Fragment files are one of the base file types of the 10x Genomics analytical platform (and other platforms) and can be easily created from any BAM file. See the 10x Genomics website for information on making your own fragment files for input to ArchR. Importantly, if you are creating your own fragment files, these files must be sorted on position and zipped/compressed using the bgzip utility. Despite having a .gz suffix, bgzipped files are different than gzipped files.

While ArchR can handle BAM files as input, fragment files are the most optimal. If using cellranger, you should not use the output of aggr in ArchR because this groups samples together, thus eliminating the ability of ArchR to parallelize across samples. If you are using non-10x Genomics data, you should be extra careful to make sure that your input files match what ArchR expects. For example, bio-rad data is often aligned using multi-species references and fragment files for ArchR can only have one species. Similarly, if your data is from single-end sequencing, you would need to create pseudo-fragments the span a single base-pair, though this is not something that we have tested or support.

Once we have our fragment files, we provide their paths as a character vector to createArrowFiles(). During creation, some basic metadata and matrices are added to each Arrow file including a “TileMatrix” containing insertion counts across genome-wide 500-bp bins (see addTileMatrix()) and a “GeneScoreMatrix” that stores predicted gene expression based on weighting insertion counts in tiles nearby a gene promoter (see addGeneScoreMatrix()).

The tutorial data can be downloaded using the getTutorialData() function. The tutorial data is approximately 0.5 GB in size. If you have already downloaded the tutorial in the current working directory, ArchR will bypass downloading.

library(ArchR)
library(parallel)
inputFiles <- getTutorialData("Hematopoiesis")
## Downloading files to HemeFragments...
## Downloading file scATAC_BMMC_R1.fragments.tsv.gz...
## Downloading file scATAC_CD34_BMMC_R1.fragments.tsv.gz...
## Downloading file scATAC_PBMC_R1.fragments.tsv.gz...
inputFiles
##                                       scATAC_BMMC_R1 
##      "HemeFragments/scATAC_BMMC_R1.fragments.tsv.gz" 
##                                  scATAC_CD34_BMMC_R1 
## "HemeFragments/scATAC_CD34_BMMC_R1.fragments.tsv.gz" 
##                                       scATAC_PBMC_R1 
##      "HemeFragments/scATAC_PBMC_R1.fragments.tsv.gz"

If we inspect inputFiles, you’ll notice that it is simply a named vector of fragment files. This vector is automatically created for convenience by the getTutorialData() function but you will need to create it yourself for your own data like so:

inputFiles_example <- c("/path/to/fragFile1.tsv.gz", "/path/to/fragFile2.tsv.gz")
names(inputFiles_example) <- c("Sample1","Sample2")

As always, before starting a project we must set the ArchRGenome and default threads for parallelization. The tutorial data was aligned to the hg19 reference genome so we will use that. The number of threads used will depend on your system so you should adjust that accordingly.

addArchRGenome("hg19")
## Setting default genome to Hg19.
addArchRThreads(threads = 16) 
## Setting default number of Parallel threads to 16.

Now we will create our Arrow Files which will take 10-15 minutes for the tutorial data. For each sample, this step will:

Read accessible fragments from the provided input files.
Calculate quality control information for each cell (i.e. TSS enrichment scores and nucleosome info).
Filter cells based on quality control parameters.
Create a genome-wide TileMatrix using 500-bp bins.
Create a GeneScoreMatrix using the custom geneAnnotation that was defined when we called addArchRGenome().

ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## ArchR logging to : ArchRLogs/ArchR-createArrows-3c354112e256-Date-2024-12-03_Time-22-34-27.114667.log
## If there is an issue, please report to github with logFile!
## Cleaning Temporary Files
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2024-12-03 22:34:27.795516 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-createArrows-3c354112e256-Date-2024-12-03_Time-22-34-27.114667.log

One important parameter in createArrowFiles() is the subThreading parameter. However, the subThreading parameter does not trump the ArchRLocking established with addArchRLocking() mentioned above. Instead, if ArchRLocking is set to FALSE, then you may enable subthreading by setting subThreading = TRUE.

We can inspect the ArrowFiles object to see that it is actually just a character vector of Arrow file paths.

ArrowFiles
## [1] "scATAC_BMMC_R1.arrow"      "scATAC_CD34_BMMC_R1.arrow"
## [3] "scATAC_PBMC_R1.arrow"