3.5 Getting Set Up

The first thing we do is change to our desired our working directory, set the number of threads we would like to use, and load our gene and genome annotations. Depending on the configuration of your local environment, you may need to modify the number of threads used below in addArchRThreads(). By default ArchR uses half of the total number of threads available but you can adjust this manually as you see fit. As a reminder, ArchR is not supported on Windows.

For the purposes of this tutorial, we are assuming that you followed the instructions in the chapter on handling ArchR’s dependencies using renv and that you have set up a working directory for running this tutorial.

First, we load the ArchR library. If this fails, you have not properly installed ArchR and should revisit the installation instructions. We will also set our seed for randomized opperations for consistency. You can set your seed to whatever you would like (here we use 1), but we encourage you to be consistent and always set the same seed.

library(ArchR)
set.seed(1)

Next, we either enable or disable file locking for HDF5 files. File locking allows for processes to take advantage of subthreading but may not work on certain network-based file systems, such as network-based lustre file systems, that are often present on high-performance computing clusters. This has caused a lot of problems for specific users, in the past (perhaps best summarized here). Thus, it is extremely important to consistently enable or disable file locking based on your particular computational environment. Enabling file locking will prevent subthreading, disabling file locking will permit it. Regardless, whatever you chose for file locking is controlled by environment variables and only persists during the current R session so you should include a call to addArchRFileLocking() in the beginning of each new R session (just like set.seed() or addArchRThreads() etc.). The filesystem used to generate this tutorial works best with locking = TRUE so that is what we set below but you should use whatever you think is best for your system. If you have trouble with Arrow File generation on the tutorial data or with various HDF5 file accessibility errors, try setting locking = TRUE at the beginning of your project (before Arrow File generation).

addArchRLocking(locking = TRUE)
## Setting ArchRLocking to TRUE.

Next, we set the default number of threads for ArchR functions. This is something you will have to do during each new R session. We recommend setting threads to 1/2 to 3/4 of the total available cores. The memory usage in ArchR will often scale with the number of threads used so allowing ArchR to use more threads will also lead to higher memory usage.

addArchRThreads(threads = 1) 
## Setting default number of Parallel threads to 1.

Then, we set the genome to be used for gene and genome annotations. As above, this is something you will have to do during each new R session. Of course, this genome version must match the genome version that was used for alignment. For the data used in this tutorial, we will use the hg19 reference genome but ArchR natively supports additional genome annotations and custome genome annotations as outlined in the next section.

addArchRGenome("hg19")
## Setting default genome to Hg19.

ArchR requires gene and genome annotations to do things such as calculate TSS enrichment scores, nucleotide content, and gene activity scores. Because our tutorial dataset uses scATAC-seq data that has already been aligned to the hg19 reference genome, we have set “hg19” as the default genome in the previous section. However, ArchR supports “hg19”, “hg38”, “mm9”, and “mm10” natively and you can create your own genome and gene annotations using the createGeneAnnotation() and createGenomeAnnotation() functions.

Providing this information to ArchR is streamlined through the addArchRGenome() function. This function tells ArchR that, for all analyses in the current session, it should use the genomeAnnotation and geneAnnotation associated with the defined ArchRGenome. Each of the natively supported genomes are composed of a BSgenome object that defines the genomic coordinates and sequence of each chromosome, a GRanges object containing a set of blacklisted regions, a TxDb object that defines the positions and structures of all genes, and an OrgDb object that provides a central gene identifier and contains mappings between this identifier and other kinds of identifiers.

Below are examples of how to load gene and genome annotations for the natively supported genomes as well as information on their BSgenome and blacklist components.

The precompiled version of the hg19 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg19, TxDb.Hsapiens.UCSC.hg19.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg19 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg19 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg19 genome:

addArchRGenome("hg19")

The precompiled version of the hg38 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg38, TxDb.Hsapiens.UCSC.hg38.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg38 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg38 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg38 genome:

addArchRGenome("hg38")

The precompiled version of the mm9 genome in ArchR uses BSgenome.Mmusculus.UCSC.mm9, TxDb.Mmusculus.UCSC.mm9.knownGene, org.Mm.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the mm9 v1 blacklist regions from Anshul Kundaje and from mitochondrial regions that show high mappability to the mm9 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled mm9 genome:

addArchRGenome("mm9")

The precompiled version of the mm10 genome in ArchR uses BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene, org.Mm.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the mm10 v2 blacklist regions and from mitochondrial regions that show high mappability to the mm10 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled mm10 genome:

addArchRGenome("mm10")

3.5.1 Creating a Custom `ArchRGenome`

As described above, an ArchRGenome consists of a genome annotation and a gene annotation. If you create your own genome and gene annotations, then you would pass these directly to createArrowFiles() using the genomeAnnotation and geneAnnotation parameters, rather than calling addArchRGenome().

To create a custom genome annotation, we use createGenomeAnnotation(). To do this, you will need the following information:

A BSgenome object which contains the sequence information for a genome. These are commonly Bioconductor packages (for example, BSgenome.Hsapiens.UCSC.hg38) that can be easily found with google.
A GRanges genomic ranges object containing a set of blacklisted regions that will be used to filter out unwanted regions from downstream analysis. This is not required but is recommended. For information on how blacklists are created, see the publication on the ENCODE blacklists.

For example, if we wanted to create a custom genome annotation from Drosophila melanogaster, we would first identify and install and load the relevant BSgenome object.

if (!requireNamespace("BSgenome.Dmelanogaster.UCSC.dm6", quietly = TRUE)){
  BiocManager::install("BSgenome.Dmelanogaster.UCSC.dm6", update = FALSE)
}
library(BSgenome.Dmelanogaster.UCSC.dm6)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'XVector'
## The following object is masked from 'package:plyr':
## 
##     compact
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:grid':
## 
##     pattern
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
## 
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
## 
##     FileForFormat

Then we create a genome annotation from this BSgenome object.

genomeAnnotation <- createGenomeAnnotation(genome = BSgenome.Dmelanogaster.UCSC.dm6)
## Getting genome..
## Attempting to infer chromSizes..
## Attempting to infer blacklist..
## Blacklist not downloaded! Continuing without, be careful for downstream biases..

Examining this object shows the constituent parts of an ArchR genome annotation.

genomeAnnotation
## List of length 3
## names(3): genome chromSizes blacklist

To create a custom gene annotation, we use createGeneAnnotation(). To do this, you will need the following information:

A TxDb object (transcript database) from Bioconductor which contains information for gene/transcript coordinates.
An OrgDb object (organism database) from Bioconductor which provides a unified framework to map between gene names and various gene identifiers.

Continuing with our example from Drosophila melanogaster, we first install and load the relevant TxDb and OrgDb objects.

if (!requireNamespace("TxDb.Dmelanogaster.UCSC.dm6.ensGene", quietly = TRUE)){
  BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene", update = FALSE)
}
if (!requireNamespace("org.Dm.eg.db", quietly = TRUE)){
  BiocManager::install("org.Dm.eg.db", update = FALSE)
}
## 
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
library(org.Dm.eg.db)

Then we create the gene annotation object.

geneAnnotation <- createGeneAnnotation(TxDb = TxDb.Dmelanogaster.UCSC.dm6.ensGene, OrgDb = org.Dm.eg.db)
## Getting Genes..
##   1 gene was dropped because it has exons located on both strands of the
##   same reference sequence or on more than one reference sequence, so
##   cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## Determined Annotation Style = ENSEMBL
## Getting Exons..
## Getting TSS..

Examining this object shows the constituent parts of an ArchR gene annotation.

geneAnnotation
## List of length 3
## names(3): genes exons TSS

In some cases, information contained within the OrgDb object and the TxDb object might not match as expected. ArchR will attempt to default to an annotation style / keytype of “ENTREZID” or “ENSEMBL” to map between gene names within the TxDb object and various gene identifiers within the OrgDb object. For some organisms, (for ex. arabidopsis), you may need to specify a particular annotation style to the annoStyle parameter of createGeneAnnotation() to tell ArchR what the keytype is of the gene names contained within your TxDb object.

Alternatively, if you dont have a TxDb and OrgDb object, you can create a geneAnnotation object from the following information :

A “genes” object - a GRanges object containing gene coordinates (start to end). This object must have a symbols column matching the symbols column of the “exons” object described below.
An “exons” object - GRanges object containing gene exon coordinates. Must have a symbols column matching the symbols column of the “genes” object described above.
A GRanges object containing standed transcription start site (TSS) coordinates.

geneAnnotation <- createGeneAnnotation(
  TSS = geneAnnotation$TSS, 
  exons = geneAnnotation$exons, 
  genes = geneAnnotation$genes
)

In general, we cannot provide much support for non-standard genomes. However, some ArchR users have created a repository to hold custom genome annotations for some popular model organisms. We encourage you to check out and contribute to that user-developed custom genomes repository. If you are looking for information on how to forge your own custom gene annotation, you should read through previous issue posts for inspiration. For species with lots of small scaffolds, please make sure to remove genes within 2 kb of scaffold ends from the gene annotation, remove unassembled scaffolds from the genome annotation, and remove any scaffolds that lack genes as these will interfere with downstream analyses. When creating a custom genome for use with ArchR, it is imperative that all of the provided information matches. More explicitly, the chromosomes/seqnames present in your genomeAnnotation must match those present in your geneAnnotation and both must match the genome used for alignment. If you have contigs with no mapped reads, these may cause problems for ArchR and you should remove them entirely from your data and annotations.

3.5.2 Using Non-standard Genomes in ArchR

ArchR implements some checks to prevent you from doing things that we consider “out of the norm”. One of these checks forces the seqnames of your genome annotation to start with “chr”. This is true in most applications but there are some genomes (for example Maize) which do not use “chr” as a prefix for the different chromosomes. To perform any ArchR analyses on such a genome, you have to tell ArchR to ignore the chromosome prefixes. To do this, you must run addArchRChrPrefix(chrPrefix = FALSE) prior to creating your Arrow files. This will turn off the requirement for “chr” prefixes on seqnames globally in the current R session. Please keep in mind ArchR converts chromosome/seqnames to character by default. Thus, if your seqnames are just integers, you will need to supply them as characters i.e. useSeqnames = c("1", "2", "3") instead of useSeqnames = c(1, 2, 3). You can always check if chromosome prefixes are required in the current R session using getArchRChrPrefix().

3.5 Getting Set Up

3.5.1 Creating a Custom ArchRGenome

3.5.2 Using Non-standard Genomes in ArchR

3.5.1 Creating a Custom `ArchRGenome`